Difference between revisions of "A general modular framework for gene set enrichment analysis"
(9 intermediate revisions by the same user not shown) | |||
Line 18: | Line 18: | ||
=== Study outcomes === | === Study outcomes === | ||
− | |||
==== Outcome O1: Gene level statistics ==== | ==== Outcome O1: Gene level statistics ==== | ||
* The choice of the gene-level statistics (t, moderated t, or correlation) does NOT have a great impact | * The choice of the gene-level statistics (t, moderated t, or correlation) does NOT have a great impact | ||
Line 26: | Line 25: | ||
==== Outcome O2: Transformation of the gene level statistics ==== | ==== Outcome O2: Transformation of the gene level statistics ==== | ||
− | * The transformation has a substantial impact | + | * The transformation of the gene level statistic has a substantial impact |
* Transformations help to find gene sets that contain up- and downregulated genes | * Transformations help to find gene sets that contain up- and downregulated genes | ||
* Combination of square transformation and rank transformation shows the best overall performance | * Combination of square transformation and rank transformation shows the best overall performance | ||
+ | * Binary transformation (i.e. using a cutpoint) and FDRs decrease the performance | ||
− | Outcomes O1 and O2 are presented as Table 2 in the original publication. | + | Outcomes O1 and O2 are presented as Table 2 in the original publication. |
− | + | ||
− | ==== Outcome | + | ==== Outcome O3: Gene set statistics ==== |
− | ... | + | * "mean and the maxmean statistic produce ... overall very good results" |
+ | * "median and the Wilcoxon test are primarily advantageous if the competitive null hypothesis is tested, or if there are many outliers in the data" | ||
+ | * "conditional FDR ... vary strongly with the choice of the gene-level statistic, transformation and permutation approach. | ||
+ | * The ES score showed a rather weak performance | ||
+ | |||
+ | Outcomes O3 are presented as Table 3 in the original publication. | ||
+ | |||
+ | ==== Outcome O4: Significance assessment ==== | ||
+ | * The parametric approach has the best power but is overoptimistic if the assumption of statistical indpendence is violated | ||
+ | * Permutation seems to slightly outperform resampling | ||
+ | * "restandardization procedure performs very similar to resampling" | ||
+ | |||
+ | Outcomes O4 are presented as Table 4 in the original publication. | ||
+ | |||
+ | ==== Outcome O5: Global approaches ==== | ||
+ | * The performance of the globaltest procedure "is not better than that of the less sophisticated univariate methods" but "is computationally a little bit faster". | ||
+ | * For Hotellings T2-test: | ||
+ | ** an "overall poor" performance was obtained | ||
+ | ** "the uncorrelated sets are found with the same reliability as with univariate approaches. However, ... the sets with correlation ... are hardly detected." | ||
+ | ** shows "improved performance with sample label permutation as opposed to gene sampling." | ||
− | + | Outcomes O5 are presented as Table 5 for the global test and in Table 6 for Hotellings T2 in the original publication. | |
==== Further outcomes ==== | ==== Further outcomes ==== | ||
− | |||
− | |||
=== Study design and evidence level === | === Study design and evidence level === | ||
Line 48: | Line 65: | ||
* 520 genes were consided as uninformative (delta=0, rho=0) | * 520 genes were consided as uninformative (delta=0, rho=0) | ||
* Altogether, nine different simulation data sets were generated that consist of the following combinations: | * Altogether, nine different simulation data sets were generated that consist of the following combinations: | ||
− | ** Gene | + | ** Gene sets with different levels of differential expression (delta \in {0, 0.75, 1, -1}) were simulated |
** Gene sets with varying levels of intra-group correlation (rho \in {0, 0.6, -0.6}) were simulated | ** Gene sets with varying levels of intra-group correlation (rho \in {0, 0.6, -0.6}) were simulated | ||
** Gene sets that contain regulated and unregulated genes (half/half) were generated as well as gene set that contain up- and downregulated genes. | ** Gene sets that contain regulated and unregulated genes (half/half) were generated as well as gene set that contain up- and downregulated genes. | ||
Line 59: | Line 76: | ||
** minus 9 insensible combinations | ** minus 9 insensible combinations | ||
** = 261 (in total) variants of gene set analyses were considered | ** = 261 (in total) variants of gene set analyses were considered | ||
− | + | * The authors count how frequently the p-values that assess significance at the gene-set level are below a significance level 0.05 | |
Line 72: | Line 89: | ||
==== Design for Outcome O2: Transformation of the gene level statistics ==== | ==== Design for Outcome O2: Transformation of the gene level statistics ==== | ||
* The outcome was generated for five different transformations (and three gene level statistics) | * The outcome was generated for five different transformations (and three gene level statistics) | ||
− | |||
− | |||
− | . | + | ==== Design for Outcome O3: Gene set statistics ==== |
+ | * Three gene set statistics were investigated: | ||
+ | ** mean | ||
+ | ** maxmean | ||
+ | ** median | ||
+ | ** ES | ||
+ | ** conditional FDR | ||
+ | ** Wilcoxon | ||
+ | * This analyses were performed for the moderated t statistic (gene level) and by using the quadratic transformation. For significance assessment, resampling was applied. | ||
− | + | ==== Design for Outcome O4: Significance assessment ==== | |
+ | * Four different approaches for assessing significance at the gene set level were evaluated: | ||
+ | ** parametric | ||
+ | ** resampling | ||
+ | ** permutation | ||
+ | ** restandardization | ||
+ | * This analysis was performed by using the moderated t as the gene level statistic in combination with a quadratic transformation and the mean as the gene set statistic | ||
− | ==== Design for Outcome | + | ==== Design for Outcome O5: Global approaches ==== |
− | * | + | * globaltest andHotelling's T2-test with a shrinkage covariance matrix was considered |
− | |||
− | |||
=== Further comments and aspects === | === Further comments and aspects === | ||
* Simulation is NOT based on characteristics or gene sets derived from real data | * Simulation is NOT based on characteristics or gene sets derived from real data | ||
* The paper provides very comprehensive outcomes in terms of combinations of approaches | * The paper provides very comprehensive outcomes in terms of combinations of approaches | ||
− | + | * After the paper was published another type of gene set statistics appeared that is based on Kolmogorov-Smirnov test. This approach is applied e.g. for GSEA. | |
=== References === | === References === | ||
− |
Latest revision as of 15:40, 25 February 2020
Contents
1 Citation
M Ackermann and K Strimmer, A general modular framework for gene set enrichment analysis, 2009, BMC Bioinformatics, 10:47, pages etc in any possible citation style.
2 Summary
Gene set analyses have a modular structure, i.e. they consist of
- gene level statistics
- gene level significance assessment
- gene set statistics
- gene set significance assessment
- statistical conclusion
Alternatively, steps 1.-3. might be replaced by a single global test.
In this paper, 261 different variants of gene set enrichment procedures were evaluated based on simulated and experimental data.
3 Study outcomes
3.1 Outcome O1: Gene level statistics
- The choice of the gene-level statistics (t, moderated t, or correlation) does NOT have a great impact
- t statistic, moderated t, and correlation fail to find gene sets that contain up- and downregulated genes
Outcomes O1 and O2 are presented as Table 2 in the original publication.
3.2 Outcome O2: Transformation of the gene level statistics
- The transformation of the gene level statistic has a substantial impact
- Transformations help to find gene sets that contain up- and downregulated genes
- Combination of square transformation and rank transformation shows the best overall performance
- Binary transformation (i.e. using a cutpoint) and FDRs decrease the performance
Outcomes O1 and O2 are presented as Table 2 in the original publication.
3.3 Outcome O3: Gene set statistics
- "mean and the maxmean statistic produce ... overall very good results"
- "median and the Wilcoxon test are primarily advantageous if the competitive null hypothesis is tested, or if there are many outliers in the data"
- "conditional FDR ... vary strongly with the choice of the gene-level statistic, transformation and permutation approach.
- The ES score showed a rather weak performance
Outcomes O3 are presented as Table 3 in the original publication.
3.4 Outcome O4: Significance assessment
- The parametric approach has the best power but is overoptimistic if the assumption of statistical indpendence is violated
- Permutation seems to slightly outperform resampling
- "restandardization procedure performs very similar to resampling"
Outcomes O4 are presented as Table 4 in the original publication.
3.5 Outcome O5: Global approaches
- The performance of the globaltest procedure "is not better than that of the less sophisticated univariate methods" but "is computationally a little bit faster".
- For Hotellings T2-test:
- an "overall poor" performance was obtained
- "the uncorrelated sets are found with the same reliability as with univariate approaches. However, ... the sets with correlation ... are hardly detected."
- shows "improved performance with sample label permutation as opposed to gene sampling."
Outcomes O5 are presented as Table 5 for the global test and in Table 6 for Hotellings T2 in the original publication.
3.6 Further outcomes
4 Study design and evidence level
4.1 General aspects
- 100 data sets were simulated
- The simulated data sets have 600 features (genes) and 20 samples (10 vs. 10)
- The data was simulated with normally distributed noise with variance equals to one
- 520 genes were consided as uninformative (delta=0, rho=0)
- Altogether, nine different simulation data sets were generated that consist of the following combinations:
- Gene sets with different levels of differential expression (delta \in {0, 0.75, 1, -1}) were simulated
- Gene sets with varying levels of intra-group correlation (rho \in {0, 0.6, -0.6}) were simulated
- Gene sets that contain regulated and unregulated genes (half/half) were generated as well as gene set that contain up- and downregulated genes.
- "The gene set statistic ES was not combined with a binary transformation since the latter does not allow a sensible ranking of the genes."
- In total
- 3 gene level statistics ×
- 5 transformations ×
- 6 gene set statistics ×
- 3 significance assessments
- minus 9 insensible combinations
- = 261 (in total) variants of gene set analyses were considered
- The authors count how frequently the p-values that assess significance at the gene-set level are below a significance level 0.05
4.2 Design for Outcome O1: Gene level statistics
- The authors consider the impact of the selected approach at for module 1 (see summary above)
- Three approaches were considered: t, moderated t and correlation
- These approaches were evaluated for five different transformations (see O2)
- Multiple other approaches
- The authors already provide the important hint that the dependency on the gene level test statistic might be more relevant for smaller sample size (e.g. 3 vs 3)
4.3 Design for Outcome O2: Transformation of the gene level statistics
- The outcome was generated for five different transformations (and three gene level statistics)
4.4 Design for Outcome O3: Gene set statistics
- Three gene set statistics were investigated:
- mean
- maxmean
- median
- ES
- conditional FDR
- Wilcoxon
- This analyses were performed for the moderated t statistic (gene level) and by using the quadratic transformation. For significance assessment, resampling was applied.
4.5 Design for Outcome O4: Significance assessment
- Four different approaches for assessing significance at the gene set level were evaluated:
- parametric
- resampling
- permutation
- restandardization
- This analysis was performed by using the moderated t as the gene level statistic in combination with a quadratic transformation and the mean as the gene set statistic
4.6 Design for Outcome O5: Global approaches
- globaltest andHotelling's T2-test with a shrinkage covariance matrix was considered
5 Further comments and aspects
- Simulation is NOT based on characteristics or gene sets derived from real data
- The paper provides very comprehensive outcomes in terms of combinations of approaches
- After the paper was published another type of gene set statistics appeared that is based on Kolmogorov-Smirnov test. This approach is applied e.g. for GSEA.