SETGEN {SETGEN} | R Documentation |
The function searches for sets of correlated variables (genes) or takes the sets as parameter and tests them for correlation with a given quantitative variable or a two class indicator variable. The familywise error rate (FWER) is controlled in the strong sense.
SETGEN(x, grpdata = NULL, y, resp.type = c("Quantitative","Two class unpaired"), r2min = 0.5, rposonly = TRUE, direction = c("both","positive","negative"), pivdom = TRUE, transform.beta = c("none","binary","truncated"), minbeta = 0.0, samedirect = FALSE, maxgene = NULL, adj.method = c("Single step", "Step down"), nres = 100, rot = FALSE, rand = FALSE, quiet = TRUE, thresholdfun = NULL, threshold = list(NULL,NULL), ranks = FALSE, file = "", p.details = 0.05, crit.vals = sort(unique(c(0.1,0.05,0.01,0.001,p.details)), decreasing=TRUE))
x |
An n by p matrix or a data.frame with columns representing variables
(features, genes) and rows representing samples. The variables (columns) have to be named. |
grpdata |
If NULL then sets of correlated variables (genes) based on x
will be created and tested for correlation with y . Otherwise, sets which are pre-defined
in grpdata will be tested. Two formats of grpdata are possible.
One option is a data.frame with 2 columns as written to a file with extension ".details" by this function.
The first column named "groups" contains the names of the source variables of the sets and
the second column named "members" contains the names of the members of the pre-defined sets.
The second option is a list in the format returned by SETGEN.generate.sets
or in the component grpdata in the output of this function. |
y |
A vector representing the response variable. It should be numeric for a quantitative response or it should be a vector which can be converted to a factor with two levels, for a two class response. |
resp.type |
Type of the response variable. "Quantitative" for a continuous variable, "Two class unpaired" for a variable with two categories. |
r2min |
Lower bound for a squared correlation between a source variable (center) of a set and any of its other members. The parameter is used for creating sets of variables. |
rposonly |
A logical variable. If TRUE then a variable is included in a set
only if the correlation between the variable and the source variable of the set is positive.
if FALSE then this correlation can be positive or negative. |
direction |
A variable with 3 possible values indicating whether the test should be
one-sided or two-sided. If "positive" then the statistics of the sets of variables the
source variables of which correlate negatively with the response are set to zero (p-value equal to 1).
This amounts to a one-sided test against an alternative in the positive direction
of the correlation. "negative" is treated by analogy. If "both" then the test is two-sided.
Then, a significant set the source variable of which correlates positively (negatively) with the response
is interpreted as significant in the positive (negative) direction.
If direction is "positive" or "negative" then rposonly has to be TRUE .
That is because it would make no sense to do a one-sided test with a set of variables some of which
are anti-correlated. |
pivdom |
A logical variable. If TRUE then sets of variables are created
such that their source variables dominate other members of the sets with respect
to their total sum of squares (i.e., also with respect to the sample variance).
This setting reduces the number of the created sets.
In the current implementation, this setting has no effect if ranks==TRUE because
rank-transformed variables all have the same variance. |
transform.beta |
A parameter with three possible values. If "none" then the individual
Beta statistics of the variables will not be transformed. If "binary" then the statististics
will be set to zero or 1 depending on whether they are smaller or greater then minbeta ,
respectively. If "truncated" then the statistics smaller than minbeta are set to 0.
These transformations are carried out in the original and in the permuted data. They amount to
a modification in the definition of the statistics of the individual variables. |
minbeta |
A value between 0 and 1 giving the threshold for the tranformation of the individual
Beta statistics. If tranform.beta=="none" then minbeta has no effect. |
samedirect |
A logical variable. If TRUE then the individual statistics
of the member variables of a set which correlate with the response with a diffrent sign
than the source variable of the set are given the value of 0. For example, if a source variable of
a set has a positive correlation with the response then the statistics of the variables
of the set which correlate negatively with the response are zeroed and they do not contribute
to the summary statistic of the set. This redefinition of the individual statistics is applied
to the original and to the permuted data. In this way, the number of false positive results,
is not inflated. |
maxgene |
The maximal number of the variables of a set contributing to the summary
statistic of the set. These variables are: the source variable and maxgene-1 variables
with the largest univariate Beta statistics. The statistics of the other variables are set to 0.
This redefinition of the summary statistic of a set is applied to the original
and to the permuted data. If NULL then this redefinition is not applied. |
adj.method |
If "Single step" then the corresponding method by Westfall and Young (1993) is used for adjusting the p-values of the sets. "Step down" refers to the less conservative version of the method by Westfall and Young. Please note: only the "Single step" method guarantees the strong control of the familywise error rate if the sets of variables are created and tested based on the same data set. However, if the sets of variables are pre-defined based on information independent from the data in which they are tested then the "Step down" method controls the familywise error rate too. |
nres |
The number of the random resamplings (permutations) which are used to determine
the adjusted p-values.
If nres is equal to 0 then all possible permutations are carried out.
However, if nres==0 and the number of the samples in the data set is greated than 12
then only 1000 random permutations are computed. |
rot |
If TRUE then rotations are used instead of permutations
(see Laeuter et al. (2005)). Rotations are not possible, if the data are rank-transformed
with ranks = TRUE |
rand |
rand determines whether the random number generator should be randomly initiated.
If FALSE (default) repetitions of the same analysis setup lead to exactly the same results. |
quiet |
A logical variable. Default value is FALSE . In this case, no summary information
about the chosen analysis procedure is printed to the screen or to a file (if file is TRUE ). |
thresholdfun |
A function or a list of functions for pre-filtering the variables
before creating sets of variables and testing them. The function is applied to each variable.
Whether a variable is excluded is determined based on the value returned by the
function and based on the value of the threshold . If more then one filtering function are
to be used then they should be given as a list of functions. For a variable to remain in the
analysis it should pass all thresholds for all filtering functions. If grpdata != NULL
then the pre-filtering is applied to the variables from the given sets of variables. Two functions
for pre-filtering of variables -
sqsum (sum of the squares) and varcoefficient (variation coefficient) are included
in the package. |
threshold |
If there is only one function in thresholdfun then it is a list
with 2 elements. If thresholdfun is a list of functions, then threshold is a list
of lists, each with 2 elements. The two elements of each list correspond to
the lower and the upper limit for the values of the corresponding function in thresholdfun .
If an element of a list is NULL then there is no prespecified upper or lower limit.
All limits for all functions in thresholdfun have to be met in order for the variable
to remain in the analysis after the pre-filtering. The pre-filtering takes place before a possible
rank-transformation so that setting ranks=TRUE does not disturb the pre-filtering. |
ranks |
A logical variable. If TRUE the original data values are replaced
by their ranks. |
file |
The name of the output files. If given, three text files are created file .summary,
file .overview and file .details corresponding to the components of the returned list.
If file equals "" (default) the results are not saved in text files. |
p.details |
Threshold on p-values of the sets which are to be returned in the details
component. If no set has a smaller p-value then the ten sets with the smallest p-values are returned. |
crit.vals |
Significance levels for which critical values should be computed. If a statistic of a set of variables exceeds a critical value then the set is significant on the significance level for which the critical value was computed. |
The function generates sets of correlated variables and tests them for correlation with a response. The significance of the generated sets is determined by permuting (see Westfall and Young (1993) and
Laeuter et al. (2009) in Section 2) or rotating (Laeuter et al. 2005) the samples.
Every variable in the data is treated as a center of a set of variables. Other variables are added
to the set based on their correlation with the center (source variable).
Then, the created sets of variables are evaluated with respect to their correlation with a response.
Currently a quantitative (continuous) or a two class response are allowed. The null hypothesis for a set
is that it contains only the null variables (i.e., variables which do not correlate with the response).
The method controls the familywise error rate (FWER) exactly and in the strong sense ,i.e.,
the probability that there is at least one non-null set among the significant sets is alfa at most.
The novelty of the method lies in the fact that it allows the sets of variables to be created and tested
based on the same data while keeping the FWER.
Please note that in this case, adj.method
must be set to "Single step" to control the FWER.
A detailed description of the method is given in Laeuter et al. (2009).
A second use of the function is testing pre-defined sets of variables with respect to a response variable.
This is done by specifying the sets of variables in the grpdata
parameter.
Then, the parameters for creating sets of variables, i.e.,
r2min
, rposonly
and pivdom
have no effect. In the case of the pre-defined sets
of variables, adj.method
can be set to "Step down" without inflating the FWER.
A list with components
summary |
A list with the parameters of the analysis.
This list can be printed in a readable format by the function SETGEN.print.parameters . |
overview |
A data.frame with three columns "variable", "tval" and "pval".
"variable" is the name of the source variable of a set. "tval" is the test statistic of the set.
"pval" is the adjusted p-value. The number of rows of the data.frame is equal to the
number of the tested sets of variables (genes). The sets are ordered by decreasing statistics. |
grpdata |
A list of the names of the variables belonging to each set. The names of the components of the list are the names of the respective source variables. |
details |
A list of data.frame s with four columns: "groups", "members", "correlation", "sqsum".
"groups" are the names of the source variables repeated as many times as there are viariables
in the sets. "members" are the names of the members of the corresponding sets. "correlation"
is the correlation of each variable with the response variable. "sqsum" is the sum of squared
deviations from the mean of each variable. The list is ordered as in overview , i.e. by decreasing
statistics. Note that in current implementation "sqsum" will be identical
for all variables if ranks==TRUE since rank-transformed variables have the same variance. |
crit.vals |
Critical values corresponding to the significance levels specified in
the parameter crit.vals . If a statistic of a set of variables exceeds a critical value
then the set is significant according to the "Single step" procedure on the significance level
for which the critical value was computed. |
If a file
is specified the function writes the results to several files: The summary information with the parameters of the analysis is written to "file
.summary" and the results are stored in "file
.overview" and "file
.details".
Rosolowski, M., Laeuter, J., Beck, M.; <maciej.rosolowski@imise.uni-leipzig.de>
Laeuter, J., Glimm, E., Eszlinger, M. 2005, Search for relevant sets of variables in a high-dimensional setup keeping the familywise error rate. Statistica Neerlandica Vol. 59, No. 3, pp. 298-312.
Laeuter, J. 2007, Hochdimensionale Statistik, Anwendung in der Genexpressionsanalyse, (German) (english title: High Dimensional Statistics, Application to Gene Expression Analysis ). Leipzig Bioinformatics Working Paper, No. 15. http://www.izbi.uni-leipzig.de/izbi/Working%20Paper/2007/WP_15_Statistik.pdf.
Laeuter, J., Horn, F., Rosolowski, M., Glimm, E. 2009, High-dimensional data analysis: Selection of variables and representation of results - Application to gene expression. Biometrical Journal
Westfall, P.H. and Young, S.S. 1993, Resampling-based Multiple Testing. New York: Wiley & Sons, Inc.
# generate data with two subsets coming # from one-factor models set.seed(100) y <- c(rep(0, 10), rep(1,10)) mu2 <- 0.5 # difference between the means of the two groups of samples f1 <- matrix(rep(c(rnorm(10), rnorm(10, mean = mu2)), 10), ncol = 10) f2 <- matrix( rep(c(rnorm(5), rnorm(5, mean = mu2), rnorm(5), rnorm(5, mean = mu2)), 10), ncol = 10 ) x <- matrix(rnorm(20*100), nrow=20) x[,1:10] <- x[,1:10] + f1 x[,11:20] <- x[,11:20] + f2 colnames(x) <- paste("v",1:100, sep="") # looking for significant single variables (e.g., differentially expressed genes) res1 <- SETGEN(x = x, y = y, resp.type = "Two class unpaired", pivdom = FALSE, r2min = 1, nres = 1000) res1$overview[1:10,] # looking for significant sets of correlated variables (genes) res2 <- SETGEN(x = x, y = y, resp.type = "Two class unpaired", pivdom = FALSE, r2min = 0.5, nres = 1000) res2$overview[1:10,] # the same using SETGEN.generate.sets first genesets <- SETGEN.generate.sets(X1 = x, r2min1 = 0.5, pivdom = FALSE) res3 <- SETGEN(x = x, y = y, grpdata = genesets, resp.type = "Two class unpaired", nres = 1000) res3$overview[1:10,]