Difference between revisions of "A general modular framework for gene set enrichment analysis"

(Created page with "__NUMBEREDHEADINGS__ === Citation === M Ackermann and K Strimmer, A general modular framework for gene set enrichment analysis, 2009, BMC Bioinformatics, 10:47, pages etc in a...")
 
 
(13 intermediate revisions by the same user not shown)
Line 4: Line 4:
  
 
[https://doi:10.1186/1471-2105-10-4 Permanent link to the paper]
 
[https://doi:10.1186/1471-2105-10-4 Permanent link to the paper]
 
  
 
=== Summary ===
 
=== Summary ===
 
Gene set analyses have a modular structure, i.e. they consist of  
 
Gene set analyses have a modular structure, i.e. they consist of  
1. gene level statistics  
+
# gene level statistics  
2. gene level significance assessment
+
# gene level significance assessment
3. gene set statistics
+
# gene set statistics
4. gene set significance assessment
+
# gene set significance assessment
5. statistical conclusion
+
# statistical conclusion
 +
 
 +
Alternatively, steps 1.-3. might be replaced by a single global test.
  
Here, 261 different variants of gene set enrichment procedures were evaluated based on simulated and experimental data.
+
In this paper, 261 different variants of gene set enrichment procedures were evaluated based on simulated and experimental data.
  
 
=== Study outcomes ===
 
=== Study outcomes ===
List the paper results concerning method comparison and benchmarking:
+
==== Outcome O1: Gene level statistics ====
==== Outcome O1 ====
+
* The choice of the gene-level statistics (t, moderated t, or correlation) does NOT have a great impact
The performance of ...
+
* t statistic, moderated t, and correlation fail to find gene sets that contain up- and downregulated genes
 +
 
 +
Outcomes O1 and O2 are presented as Table 2 in the original publication.  
  
Outcome O1 is presented as Figure X in the original publication.  
+
==== Outcome O2: Transformation of the gene level statistics ====
 +
* The transformation of the gene level statistic has a substantial impact
 +
* Transformations help to find gene sets that contain up- and downregulated genes
 +
* Combination of square transformation and rank transformation shows the best overall performance
 +
* Binary transformation (i.e. using a cutpoint) and FDRs decrease the performance
  
==== Outcome O2 ====
+
Outcomes O1 and O2 are presented as Table 2 in the original publication.
...
 
  
Outcome O2 is presented as Figure X in the original publication.  
+
==== Outcome O3: Gene set statistics ====
+
* "mean and the maxmean statistic produce ... overall very good results"
==== Outcome On ====
+
* "median and the Wilcoxon test are primarily advantageous if the competitive null hypothesis is tested, or if there are many outliers in the data"
...
+
* "conditional FDR ... vary strongly with the choice of the gene-level statistic, transformation and permutation approach.
 +
* The ES score showed a rather weak performance
 +
 
 +
Outcomes O3 are presented as Table 3 in the original publication.  
 +
 
 +
==== Outcome O4: Significance assessment ====
 +
* The parametric approach has the best power but is overoptimistic if the assumption of statistical indpendence is violated
 +
* Permutation seems to slightly outperform resampling
 +
* "restandardization procedure performs very similar to resampling"
 +
 
 +
Outcomes O4 are presented as Table 4 in the original publication.
 +
 
 +
==== Outcome O5: Global approaches ====
 +
* The performance of the globaltest procedure "is not better than that of the less sophisticated univariate methods" but "is computationally a little bit faster".
 +
* For Hotellings T2-test:
 +
** an "overall poor" performance was obtained
 +
** "the uncorrelated sets are found with the same reliability as with univariate approaches. However, ... the sets with correlation ... are hardly detected."
 +
** shows "improved performance with sample label permutation as opposed to gene sampling."
  
Outcome On is presented as Figure X in the original publication.  
+
Outcomes O5 are presented as Table 5 for the global test and in Table 6 for Hotellings T2 in the original publication.
  
 
==== Further outcomes ====
 
==== Further outcomes ====
If intended, you can add further outcomes here.
 
 
  
 
=== Study design and evidence level ===
 
=== Study design and evidence level ===
 
==== General aspects ====
 
==== General aspects ====
You can describe general design aspects here.
+
* 100 data sets were simulated
The study designs for describing specific outcomes are listed in the following subsections:
+
* The simulated data sets have 600 features (genes) and 20 samples (10 vs. 10)
 +
* The data was simulated with normally distributed noise with variance equals to one
 +
* 520 genes were consided as uninformative (delta=0, rho=0)
 +
* Altogether, nine different simulation data sets were generated that consist of the following combinations:
 +
** Gene sets with different levels of differential expression (delta \in {0, 0.75, 1, -1}) were simulated
 +
** Gene sets with varying levels of intra-group correlation (rho \in {0, 0.6, -0.6}) were simulated
 +
** Gene sets that contain regulated and unregulated genes (half/half) were generated as well as gene set that contain up- and downregulated genes.
 +
* "The gene set statistic ES was not combined with a binary transformation since the latter does not allow a sensible ranking of the genes."
 +
* In total
 +
** 3 gene level statistics ×
 +
** 5 transformations ×
 +
** 6 gene set statistics ×
 +
** 3 significance assessments
 +
** minus 9 insensible combinations
 +
** = 261 (in total) variants of gene set analyses were considered
 +
* The authors count how frequently the p-values that assess significance at the gene-set level are below a significance level 0.05
 +
 +
 
 +
==== Design for Outcome O1: Gene level statistics ====
 +
* The authors consider the impact of the selected approach at for module 1 (see summary above)
 +
* Three approaches were considered: t, moderated t and correlation
 +
* These approaches were evaluated for five different transformations (see O2)
 +
 
 +
* Multiple other approaches
 +
* The authors already provide the important hint that the dependency on the gene level test statistic might be more relevant for smaller sample size (e.g. 3 vs 3)
 +
 
 +
==== Design for Outcome O2: Transformation of the gene level statistics ====
 +
* The outcome was generated for five different transformations (and three gene level statistics)
  
==== Design for Outcome O1 ====
+
==== Design for Outcome O3: Gene set statistics ====
* The outcome was generated for ...
+
* Three gene set statistics were investigated:
* Configuration parameters were chosen ...
+
** mean
* ...
+
** maxmean
==== Design for Outcome O2 ====
+
** median
* The outcome was generated for ...
+
** ES
* Configuration parameters were chosen ...
+
** conditional FDR
* ...
+
** Wilcoxon
 +
* This analyses were performed for the moderated t statistic (gene level) and by using the quadratic transformation. For significance assessment, resampling was applied.
  
...
+
==== Design for Outcome O4: Significance assessment ====
 +
* Four different approaches for assessing significance at the gene set level were evaluated:
 +
** parametric
 +
** resampling
 +
** permutation
 +
** restandardization
 +
* This analysis was performed by using the moderated t as the gene level statistic in combination with a quadratic transformation and the mean as the gene set statistic
  
==== Design for Outcome O ====
+
==== Design for Outcome O5: Global approaches ====
* The outcome was generated for ...
+
* globaltest andHotelling's T2-test with a shrinkage covariance matrix was considered
* Configuration parameters were chosen ...
 
* ...
 
  
 
=== Further comments and aspects ===
 
=== Further comments and aspects ===
 +
* Simulation is NOT based on characteristics or gene sets derived from real data
 +
* The paper provides very comprehensive outcomes in terms of combinations of approaches
 +
* After the paper was published another type of gene set statistics appeared that is based on Kolmogorov-Smirnov test. This approach is applied e.g. for GSEA.
  
 
=== References ===
 
=== References ===
The list of cited or related literature is placed here.
 

Latest revision as of 15:40, 25 February 2020

1 Citation

M Ackermann and K Strimmer, A general modular framework for gene set enrichment analysis, 2009, BMC Bioinformatics, 10:47, pages etc in any possible citation style.

Permanent link to the paper

2 Summary

Gene set analyses have a modular structure, i.e. they consist of

  1. gene level statistics
  2. gene level significance assessment
  3. gene set statistics
  4. gene set significance assessment
  5. statistical conclusion

Alternatively, steps 1.-3. might be replaced by a single global test.

In this paper, 261 different variants of gene set enrichment procedures were evaluated based on simulated and experimental data.

3 Study outcomes

3.1 Outcome O1: Gene level statistics

  • The choice of the gene-level statistics (t, moderated t, or correlation) does NOT have a great impact
  • t statistic, moderated t, and correlation fail to find gene sets that contain up- and downregulated genes

Outcomes O1 and O2 are presented as Table 2 in the original publication.

3.2 Outcome O2: Transformation of the gene level statistics

  • The transformation of the gene level statistic has a substantial impact
  • Transformations help to find gene sets that contain up- and downregulated genes
  • Combination of square transformation and rank transformation shows the best overall performance
  • Binary transformation (i.e. using a cutpoint) and FDRs decrease the performance

Outcomes O1 and O2 are presented as Table 2 in the original publication.

3.3 Outcome O3: Gene set statistics

  • "mean and the maxmean statistic produce ... overall very good results"
  • "median and the Wilcoxon test are primarily advantageous if the competitive null hypothesis is tested, or if there are many outliers in the data"
  • "conditional FDR ... vary strongly with the choice of the gene-level statistic, transformation and permutation approach.
  • The ES score showed a rather weak performance

Outcomes O3 are presented as Table 3 in the original publication.

3.4 Outcome O4: Significance assessment

  • The parametric approach has the best power but is overoptimistic if the assumption of statistical indpendence is violated
  • Permutation seems to slightly outperform resampling
  • "restandardization procedure performs very similar to resampling"

Outcomes O4 are presented as Table 4 in the original publication.

3.5 Outcome O5: Global approaches

  • The performance of the globaltest procedure "is not better than that of the less sophisticated univariate methods" but "is computationally a little bit faster".
  • For Hotellings T2-test:
    • an "overall poor" performance was obtained
    • "the uncorrelated sets are found with the same reliability as with univariate approaches. However, ... the sets with correlation ... are hardly detected."
    • shows "improved performance with sample label permutation as opposed to gene sampling."

Outcomes O5 are presented as Table 5 for the global test and in Table 6 for Hotellings T2 in the original publication.

3.6 Further outcomes

4 Study design and evidence level

4.1 General aspects

  • 100 data sets were simulated
  • The simulated data sets have 600 features (genes) and 20 samples (10 vs. 10)
  • The data was simulated with normally distributed noise with variance equals to one
  • 520 genes were consided as uninformative (delta=0, rho=0)
  • Altogether, nine different simulation data sets were generated that consist of the following combinations:
    • Gene sets with different levels of differential expression (delta \in {0, 0.75, 1, -1}) were simulated
    • Gene sets with varying levels of intra-group correlation (rho \in {0, 0.6, -0.6}) were simulated
    • Gene sets that contain regulated and unregulated genes (half/half) were generated as well as gene set that contain up- and downregulated genes.
  • "The gene set statistic ES was not combined with a binary transformation since the latter does not allow a sensible ranking of the genes."
  • In total
    • 3 gene level statistics ×
    • 5 transformations ×
    • 6 gene set statistics ×
    • 3 significance assessments
    • minus 9 insensible combinations
    • = 261 (in total) variants of gene set analyses were considered
  • The authors count how frequently the p-values that assess significance at the gene-set level are below a significance level 0.05


4.2 Design for Outcome O1: Gene level statistics

  • The authors consider the impact of the selected approach at for module 1 (see summary above)
  • Three approaches were considered: t, moderated t and correlation
  • These approaches were evaluated for five different transformations (see O2)
  • Multiple other approaches
  • The authors already provide the important hint that the dependency on the gene level test statistic might be more relevant for smaller sample size (e.g. 3 vs 3)

4.3 Design for Outcome O2: Transformation of the gene level statistics

  • The outcome was generated for five different transformations (and three gene level statistics)

4.4 Design for Outcome O3: Gene set statistics

  • Three gene set statistics were investigated:
    • mean
    • maxmean
    • median
    • ES
    • conditional FDR
    • Wilcoxon
  • This analyses were performed for the moderated t statistic (gene level) and by using the quadratic transformation. For significance assessment, resampling was applied.

4.5 Design for Outcome O4: Significance assessment

  • Four different approaches for assessing significance at the gene set level were evaluated:
    • parametric
    • resampling
    • permutation
    • restandardization
  • This analysis was performed by using the moderated t as the gene level statistic in combination with a quadratic transformation and the mean as the gene set statistic

4.6 Design for Outcome O5: Global approaches

  • globaltest andHotelling's T2-test with a shrinkage covariance matrix was considered

5 Further comments and aspects

  • Simulation is NOT based on characteristics or gene sets derived from real data
  • The paper provides very comprehensive outcomes in terms of combinations of approaches
  • After the paper was published another type of gene set statistics appeared that is based on Kolmogorov-Smirnov test. This approach is applied e.g. for GSEA.

6 References