SETGEN {SETGEN}R Documentation

Function to test sets of variables for correlation with a response

Description

The function searches for sets of correlated variables (genes) or takes the sets as parameter and tests them for correlation with a given quantitative variable or a two class indicator variable. The familywise error rate (FWER) is controlled in the strong sense.

Usage

SETGEN(x, grpdata = NULL, y, resp.type = c("Quantitative","Two class unpaired"),
r2min = 0.5, rposonly = TRUE, direction = c("both","positive","negative"), 
pivdom = TRUE, transform.beta = c("none","binary","truncated"), 
minbeta = 0.0, samedirect = FALSE, maxgene = NULL, 
adj.method = c("Single step", "Step down"), nres = 100, rot = FALSE,
rand = FALSE, quiet = TRUE, thresholdfun = NULL, 
threshold = list(NULL,NULL), ranks = FALSE, file = "", p.details = 0.05,
crit.vals = sort(unique(c(0.1,0.05,0.01,0.001,p.details)),
               decreasing=TRUE)) 

Arguments

x An n by p matrix or a data.frame with columns representing variables (features, genes) and rows representing samples. The variables (columns) have to be named.
grpdata If NULL then sets of correlated variables (genes) based on x will be created and tested for correlation with y. Otherwise, sets which are pre-defined in grpdata will be tested. Two formats of grpdata are possible. One option is a data.frame with 2 columns as written to a file with extension ".details" by this function. The first column named "groups" contains the names of the source variables of the sets and the second column named "members" contains the names of the members of the pre-defined sets. The second option is a list in the format returned by SETGEN.generate.sets or in the component grpdata in the output of this function.
y A vector representing the response variable. It should be numeric for a quantitative response or it should be a vector which can be converted to a factor with two levels, for a two class response.
resp.type Type of the response variable. "Quantitative" for a continuous variable, "Two class unpaired" for a variable with two categories.
r2min Lower bound for a squared correlation between a source variable (center) of a set and any of its other members. The parameter is used for creating sets of variables.
rposonly A logical variable. If TRUE then a variable is included in a set only if the correlation between the variable and the source variable of the set is positive. if FALSE then this correlation can be positive or negative.
direction A variable with 3 possible values indicating whether the test should be one-sided or two-sided. If "positive" then the statistics of the sets of variables the source variables of which correlate negatively with the response are set to zero (p-value equal to 1). This amounts to a one-sided test against an alternative in the positive direction of the correlation. "negative" is treated by analogy. If "both" then the test is two-sided. Then, a significant set the source variable of which correlates positively (negatively) with the response is interpreted as significant in the positive (negative) direction. If direction is "positive" or "negative" then rposonly has to be TRUE. That is because it would make no sense to do a one-sided test with a set of variables some of which are anti-correlated.
pivdom A logical variable. If TRUE then sets of variables are created such that their source variables dominate other members of the sets with respect to their total sum of squares (i.e., also with respect to the sample variance). This setting reduces the number of the created sets. In the current implementation, this setting has no effect if ranks==TRUE because rank-transformed variables all have the same variance.
transform.beta A parameter with three possible values. If "none" then the individual Beta statistics of the variables will not be transformed. If "binary" then the statististics will be set to zero or 1 depending on whether they are smaller or greater then minbeta, respectively. If "truncated" then the statistics smaller than minbeta are set to 0. These transformations are carried out in the original and in the permuted data. They amount to a modification in the definition of the statistics of the individual variables.
minbeta A value between 0 and 1 giving the threshold for the tranformation of the individual Beta statistics. If tranform.beta=="none" then minbeta has no effect.
samedirect A logical variable. If TRUE then the individual statistics of the member variables of a set which correlate with the response with a diffrent sign than the source variable of the set are given the value of 0. For example, if a source variable of a set has a positive correlation with the response then the statistics of the variables of the set which correlate negatively with the response are zeroed and they do not contribute to the summary statistic of the set. This redefinition of the individual statistics is applied to the original and to the permuted data. In this way, the number of false positive results, is not inflated.
maxgene The maximal number of the variables of a set contributing to the summary statistic of the set. These variables are: the source variable and maxgene-1 variables with the largest univariate Beta statistics. The statistics of the other variables are set to 0. This redefinition of the summary statistic of a set is applied to the original and to the permuted data. If NULL then this redefinition is not applied.
adj.method If "Single step" then the corresponding method by Westfall and Young (1993) is used for adjusting the p-values of the sets. "Step down" refers to the less conservative version of the method by Westfall and Young. Please note: only the "Single step" method guarantees the strong control of the familywise error rate if the sets of variables are created and tested based on the same data set. However, if the sets of variables are pre-defined based on information independent from the data in which they are tested then the "Step down" method controls the familywise error rate too.
nres The number of the random resamplings (permutations) which are used to determine the adjusted p-values. If nres is equal to 0 then all possible permutations are carried out. However, if nres==0 and the number of the samples in the data set is greated than 12 then only 1000 random permutations are computed.
rot If TRUE then rotations are used instead of permutations (see Laeuter et al. (2005)). Rotations are not possible, if the data are rank-transformed with ranks = TRUE
rand rand determines whether the random number generator should be randomly initiated. If FALSE (default) repetitions of the same analysis setup lead to exactly the same results.
quiet A logical variable. Default value is FALSE. In this case, no summary information about the chosen analysis procedure is printed to the screen or to a file (if file is TRUE).
thresholdfun A function or a list of functions for pre-filtering the variables before creating sets of variables and testing them. The function is applied to each variable. Whether a variable is excluded is determined based on the value returned by the function and based on the value of the threshold. If more then one filtering function are to be used then they should be given as a list of functions. For a variable to remain in the analysis it should pass all thresholds for all filtering functions. If grpdata != NULL then the pre-filtering is applied to the variables from the given sets of variables. Two functions for pre-filtering of variables - sqsum (sum of the squares) and varcoefficient (variation coefficient) are included in the package.
threshold If there is only one function in thresholdfun then it is a list with 2 elements. If thresholdfun is a list of functions, then threshold is a list of lists, each with 2 elements. The two elements of each list correspond to the lower and the upper limit for the values of the corresponding function in thresholdfun. If an element of a list is NULL then there is no prespecified upper or lower limit. All limits for all functions in thresholdfun have to be met in order for the variable to remain in the analysis after the pre-filtering. The pre-filtering takes place before a possible rank-transformation so that setting ranks=TRUE does not disturb the pre-filtering.
ranks A logical variable. If TRUE the original data values are replaced by their ranks.
file The name of the output files. If given, three text files are created file.summary, file.overview and file.details corresponding to the components of the returned list. If file equals "" (default) the results are not saved in text files.
p.details Threshold on p-values of the sets which are to be returned in the details component. If no set has a smaller p-value then the ten sets with the smallest p-values are returned.
crit.vals Significance levels for which critical values should be computed. If a statistic of a set of variables exceeds a critical value then the set is significant on the significance level for which the critical value was computed.

Details

The function generates sets of correlated variables and tests them for correlation with a response. The significance of the generated sets is determined by permuting (see Westfall and Young (1993) and Laeuter et al. (2009) in Section 2) or rotating (Laeuter et al. 2005) the samples. Every variable in the data is treated as a center of a set of variables. Other variables are added to the set based on their correlation with the center (source variable). Then, the created sets of variables are evaluated with respect to their correlation with a response. Currently a quantitative (continuous) or a two class response are allowed. The null hypothesis for a set is that it contains only the null variables (i.e., variables which do not correlate with the response). The method controls the familywise error rate (FWER) exactly and in the strong sense ,i.e., the probability that there is at least one non-null set among the significant sets is alfa at most. The novelty of the method lies in the fact that it allows the sets of variables to be created and tested based on the same data while keeping the FWER. Please note that in this case, adj.method must be set to "Single step" to control the FWER. A detailed description of the method is given in Laeuter et al. (2009).

A second use of the function is testing pre-defined sets of variables with respect to a response variable. This is done by specifying the sets of variables in the grpdata parameter. Then, the parameters for creating sets of variables, i.e., r2min, rposonly and pivdom have no effect. In the case of the pre-defined sets of variables, adj.method can be set to "Step down" without inflating the FWER.

Value

A list with components

summary A list with the parameters of the analysis. This list can be printed in a readable format by the function SETGEN.print.parameters .
overview A data.frame with three columns "variable", "tval" and "pval". "variable" is the name of the source variable of a set. "tval" is the test statistic of the set. "pval" is the adjusted p-value. The number of rows of the data.frame is equal to the number of the tested sets of variables (genes). The sets are ordered by decreasing statistics.
grpdata A list of the names of the variables belonging to each set. The names of the components of the list are the names of the respective source variables.
details A list of data.frames with four columns: "groups", "members", "correlation", "sqsum". "groups" are the names of the source variables repeated as many times as there are viariables in the sets. "members" are the names of the members of the corresponding sets. "correlation" is the correlation of each variable with the response variable. "sqsum" is the sum of squared deviations from the mean of each variable. The list is ordered as in overview, i.e. by decreasing statistics. Note that in current implementation "sqsum" will be identical for all variables if ranks==TRUE since rank-transformed variables have the same variance.
crit.vals Critical values corresponding to the significance levels specified in the parameter crit.vals. If a statistic of a set of variables exceeds a critical value then the set is significant according to the "Single step" procedure on the significance level for which the critical value was computed.


If a file is specified the function writes the results to several files: The summary information with the parameters of the analysis is written to "file.summary" and the results are stored in "file.overview" and "file.details".

Author(s)

Rosolowski, M., Laeuter, J., Beck, M.; <maciej.rosolowski@imise.uni-leipzig.de>

References

Laeuter, J., Glimm, E., Eszlinger, M. 2005, Search for relevant sets of variables in a high-dimensional setup keeping the familywise error rate. Statistica Neerlandica Vol. 59, No. 3, pp. 298-312.

Laeuter, J. 2007, Hochdimensionale Statistik, Anwendung in der Genexpressionsanalyse, (German) (english title: High Dimensional Statistics, Application to Gene Expression Analysis ). Leipzig Bioinformatics Working Paper, No. 15. http://www.izbi.uni-leipzig.de/izbi/Working%20Paper/2007/WP_15_Statistik.pdf.

Laeuter, J., Horn, F., Rosolowski, M., Glimm, E. 2009, High-dimensional data analysis: Selection of variables and representation of results - Application to gene expression. Biometrical Journal

Westfall, P.H. and Young, S.S. 1993, Resampling-based Multiple Testing. New York: Wiley & Sons, Inc.

See Also

SETGEN.generate.sets

Examples

# generate data with two subsets coming
# from one-factor models
set.seed(100)
y <- c(rep(0, 10), rep(1,10))
mu2 <- 0.5  # difference between the means of the two groups of samples
f1 <- matrix(rep(c(rnorm(10), rnorm(10, mean = mu2)), 10), ncol = 10)
f2 <- matrix( rep(c(rnorm(5), rnorm(5, mean = mu2), rnorm(5), rnorm(5, mean = mu2)), 10), ncol = 10 )
x <- matrix(rnorm(20*100), nrow=20)
x[,1:10] <- x[,1:10] + f1
x[,11:20] <- x[,11:20] + f2  
colnames(x) <- paste("v",1:100, sep="")

# looking for significant single variables (e.g., differentially expressed genes)
res1 <- SETGEN(x = x, y = y, resp.type = "Two class unpaired", pivdom = FALSE, r2min = 1, nres = 1000)
res1$overview[1:10,]

# looking for significant sets of correlated variables (genes)
res2 <- SETGEN(x = x, y = y, resp.type = "Two class unpaired", pivdom = FALSE, r2min = 0.5, nres = 1000)
res2$overview[1:10,]

# the same using SETGEN.generate.sets first
genesets <- SETGEN.generate.sets(X1 = x, r2min1 = 0.5, pivdom = FALSE)
res3 <- SETGEN(x = x, y = y, grpdata = genesets, resp.type = "Two class unpaired", nres = 1000)
res3$overview[1:10,]

[Package SETGEN version 0.1 Index]