Difference between revisions of "Gene set analysis methods: a systematic comparison"

Line 7: Line 7:
 
=== Summary ===
 
=== Summary ===
 
Approaches for gene set analyses were assessed by using simulated data that were generated based on a real experimental data set.
 
Approaches for gene set analyses were assessed by using simulated data that were generated based on a real experimental data set.
 +
 +
There are competitive tests (COMP) that uses the distribution of a reference gene set (e.g. all gene that are not in the gene set) as reference and self-contained (SELF) approaches that do not rely on a reference.
  
 
* The authors compared four different methods:  
 
* The authors compared four different methods:  
** Gene Set Enrichment Analysis (GSEA)
+
** Gene Set Enrichment Analysis (GSEA-SELF and GSEA-COMP)
** Significance Analysis of Function and Expression (SAFE)
+
** Significance Analysis of Function and Expression (SAFE) based on the t-test as gene-wise test and offers Wilcoxon rank sum, Fisher’s Exact Test, Pearson’s Chi-squared type statistic and a t-statistic as global (gene set wide) tests
** sigPathway
+
** sigPathway  
 
** Correlation Adjusted Mean Rank (CAMERA)
 
** Correlation Adjusted Mean Rank (CAMERA)
 
  
  
 
=== Study outcomes ===
 
=== Study outcomes ===
 
 
==== Outcome O1: False positives under null distribution ====
 
==== Outcome O1: False positives under null distribution ====
 
The frequency of false-positives was assessed by using an alpha=0.05.
 
The frequency of false-positives was assessed by using an alpha=0.05.
Line 84: Line 84:
  
 
=== Further comments and aspects ===
 
=== Further comments and aspects ===
 +
* Gene sets from MSigDB were used
 +
* The authors are aware of the fact that different null hypotheses are tested by the different approaches
 +
* sigPathway and CAMERA offers other options that are discussed in the article but not evaluated

Revision as of 13:20, 25 February 2020

1 Citation

Mathur, R., Rotroff, D., Ma, J., Shojaie, A., & Motsinger-Reif, A. , Gene set analysis methods: a systematic comparison, 2018, BioData mining, 11(1), 8.

Permanent link to the paper

2 Summary

Approaches for gene set analyses were assessed by using simulated data that were generated based on a real experimental data set.

There are competitive tests (COMP) that uses the distribution of a reference gene set (e.g. all gene that are not in the gene set) as reference and self-contained (SELF) approaches that do not rely on a reference.

  • The authors compared four different methods:
    • Gene Set Enrichment Analysis (GSEA-SELF and GSEA-COMP)
    • Significance Analysis of Function and Expression (SAFE) based on the t-test as gene-wise test and offers Wilcoxon rank sum, Fisher’s Exact Test, Pearson’s Chi-squared type statistic and a t-statistic as global (gene set wide) tests
    • sigPathway
    • Correlation Adjusted Mean Rank (CAMERA)


3 Study outcomes

3.1 Outcome O1: False positives under null distribution

The frequency of false-positives was assessed by using an alpha=0.05. Consequently all approaches (except FET-1k) showed around 5% false-positive or less. FET-1k ("FET global statistic in SAFE") had around than 20%.

Outcome O1 is presented as Figure 2 in the original publication for the prostate data template and in the "Additional File 1" for the other templates.

Baseline of this outcome is that all approaches excep FET-1k perform similarly well in terms of false-positives.

3.2 Outcome O2

  • sigPathway showed superior performance
  • SAFE-Wilcoxon could NOT detect the differentially regulated pathway(s).
  • In general, the performance increases with increasing fraction of regulated genes (parameter pi in the paper), except for "Comp GSEA Q" that shows counterintuitive performance.

Outcome O2 is presented as Figure 3 in the original publication, the numbers are provided in the supplement.

3.3 Outcome O3

  • SAFE again performs weak for most configurations
  • Only "aveDiff-boot" seems to have a good power that improves with increasing magnitudes tau of regulation
  • FET-1k, FET-10k could identify the regulated pathway but shows counterintuitive performance (i.e. decreasing performances for increasing magnitudes of regulation)

Outcome O3 is presented as Figure 4 in the original publication.

3.4 Outcome O4

  • COMP-GSEA-FDR and Self-GSEA-FDR showed superior performance
  • Comp-GSEA-Q and SELF-GSEA-Q showed counterintuitive performance, i.e. the performance deceases with increasing effect size tau


4 Study design and evidence level

4.1 General aspects

  • The authors consider different sizes of the gene sets
  • The authors consider different proportions of regulated genes in the gene sets
  • The authors consider different magnitudes of the underlying effect size (i.e. log-fold-changes)
  • The authors consider three null simulations (without regulation) as reference for outcome O1
  • In this publication, the authors published a novel simulation approach termed (FANGS)
  • The simulation approach is available in this R package (FANGS) offers the opportunity to reproduce the simulations and repeat the analysis for other gene set methods.
  • The authors provide a comprehensive list of the used configuration parameters
  • The authors evaluated the following alternative configurations
    • For GSEA one alternative
    • For SAFE five alternative setups
    • For sigPathway and CAMERA no other configurations were considered
  • Three experimental data sets were used as foundations for simulating data
    • prostate cancer (264 cases, 160 controls)
    • ischemic stroke (20 cases, 20 controls)
    • normal brain tissue (21 cases, 20 controls)

4.2 Design for Outcome O1

  • The authors consider three null simulations (without regulation) as reference:
    • permutation of class labels
    • independently sampled expression of all features (=genes)
    • centering the simulated data, i.e. set effect size to zero
  • Default configuration parameters and the alternative parameters described above were evaluated
  • Only the prostat cancer data set was considered as template for simulations

4.3 Design for Outcome O2

  • The outcome was generated by simulating differential expression of one pathway
  • The analysis was repeated for all three data sets as template
  • For each of the three data sets the analysis was repeated by selecting two different pathways as differentially regulated.
  • In total, six analyses were performed (3 data sets x 2 regulated pathways)
  • Default configuration parameters were chosen

4.4 Design for Outcome O3

  • The weak performance of SAFE for the default configuration in O2 seems to be the motivation for investigation of other configurations for SAFE
  • The outcome O3 was only generated for one data set (prostate cancer) and two regulated pathways

5 Further comments and aspects

  • Gene sets from MSigDB were used
  • The authors are aware of the fact that different null hypotheses are tested by the different approaches
  • sigPathway and CAMERA offers other options that are discussed in the article but not evaluated