Difference between revisions of "DMRcaller: a versatile R/Bioconductor package for detection and visualization of differentially methylated regions in CpG and non-CpG contexts"

(DMRcaller: a versatile R/Bioconductor package for detection and visualization of differentially methylated regions in CpG and non-CpG contexts)
Line 11: Line 11:
 
The paper considers identification of differentially methylated regions (DMRs) from bisulfite sequencing data (BSSEQ). A new package is introduced. The package allows choosing between three approaches (individual cytosines, pooled tiling-intervals or smoothed data) for merging data from individuals cytosines. Several test statistics can be chosen (Fisher's Exact test, score-test, beta regression). Differential positions/bins/regions are selected based on three requirements: p-value<threshold, methylation difference>threshold, coverage>threshold.
 
The paper considers identification of differentially methylated regions (DMRs) from bisulfite sequencing data (BSSEQ). A new package is introduced. The package allows choosing between three approaches (individual cytosines, pooled tiling-intervals or smoothed data) for merging data from individuals cytosines. Several test statistics can be chosen (Fisher's Exact test, score-test, beta regression). Differential positions/bins/regions are selected based on three requirements: p-value<threshold, methylation difference>threshold, coverage>threshold.
  
Publicly available data from Arabidopsis thaliana,rice and human cell lines was analyzed.
+
Publicly available data from Arabidopsis thaliana, rice and human cell lines were analyzed.
  
 
The performance of the presented approach was assessed by comparing with three other methods (methylKit, methylSig and methylPipe).
 
The performance of the presented approach was assessed by comparing with three other methods (methylKit, methylSig and methylPipe).
Line 19: Line 19:
  
 
==== Outcome O1 ====
 
==== Outcome O1 ====
The performance of DMR-caller-B and DMR-caller-NF is superior to methlySig and methylKit
+
The performance of DMR-caller-B (B denotes "binning") and DMR-caller-NF (NF denotes "noise-filter"=smoothing) is superior to methlySig and methylKit.
 +
DMR-caller-NF exhibits the best overall performance.
 +
 
 +
Outcome O1 is presented as Figure 2 in the original publication.
  
 
==== Outcome O2 ====
 
==== Outcome O2 ====
...
+
A "scrambled data set" was generated by random permutation of the genomic positions of a
+
IMR90 versus H1
==== Outcome On ====
+
The difference between of the DMR genome coverage calculated for real and scrambled data has been used for assessing the accuracy.  
...
+
 
 +
Outcome O2 is presented as Figure 3 in the original publication.
 +
 
 +
==== Outcome O3 ====
 +
For non-CpG context, the following results were obtained:
 +
* For CpHpG, DMRcaller-NF and DMR-caller-B are superior to methylSig and methylKit
 +
* For CpHpH, DMRcaller-NF has a very weak performance. In this context, methylSig or DMRcaller-B are superior (depending on the window size).
 +
 
 +
Outcomes O3 is presented as Figure 3 in the original publication.
 +
 
 +
 
 +
==== Outcome O4 ====
 +
* As validation, publicly available BSSEQ data from rice comparing endosperm to embryo was used to evaluate the DMRs predicted by DMCcaller-B with gene annotation.  
 +
* DMCcaller-B outperformed methylKit and methylSig in both total number of DMRs called and proportion of endosperm-preferred genes overlapping DMRs
 +
 
 +
Outcomes O4 are presented as Figure 6 and Supplementary Figure S8 in the original publication.
 +
 
  
 
==== Further outcomes ====
 
==== Further outcomes ====
Boxplots for the run-times of DMRcaller−B, DMRcaller−NF, MethylKit, MethylSig, MethylPipe are provided. MithylKit and MethylSig are fastest.
+
* Boxplots for the run-times of DMRcaller-B, DMRcaller-NF, MethylKit, MethylSig, MethylPipe are provided. MithylKit and MethylSig are fastest.
 +
* It was observed that the difference between the best method (DMR-caller-NF) and other methods is mainly the length of the predicted DMRs, i.e. in most cases the same regions were found but DMR-caller-NF predicts larger intervals as DMRs.
 +
DMRs which are only predicted by DMRcaller-NF are short indicating enhanced abilities of DMRcaller-NF to identify short DMRs.
 +
* For the comparison H1 vs. IMR90, the overlap between the DMR predictions of the different methods was between 62% and 78% (Supplementary Figure S7A).
 +
* methylPipe could not be applied partly because methyPipe "cannot call DMRs that contain less than five differentially methylated CpGs" and could therefore not be applied for scrambled data. The overlap of methylPipe with other methods is rather small and shown as Supplementary Figure S7.
 +
 
  
 
=== Study design and evidence level ===
 
=== Study design and evidence level ===
 
==== General aspects ====
 
==== General aspects ====
  
You can describe general design aspects here.
+
Smoothing approaches are base on the assumption that neighboring methylation sites have correlated methylation levels.
The study designs for describing specific outcomes are listed in the following subsections:
+
The paper provides observed correlations for CpG, CpHpG and CpHpH contexts.
 +
Methylation of CpG is correlated with
 +
Methylation of CpHpG is weakly correlated (xxx) and methylation at CpHpH sites has a very low correlation (xx) suggesting that smoothing might yield weak performance for CpHpH context.
 +
 
 +
The study designs for describing specific outcome (O1-O3) are listed in the following subsections:
  
 
==== Design for Outcome O1 ====
 
==== Design for Outcome O1 ====
* The outcome was generated for ...
+
* Wildtype was compared with methyltransferase knockouts and it was assumed that this leads to methylation differences.
* Configuration parameters were chosen ...
+
The authors conclude that all predicted DMRs for CpG context are therefore "true positives" which seems a very questionable assumption because unmethylated regions remain unmethylated.
* ...
+
* The authors assess performance in terms of genome coverage of the predicted DMRs, i.e. in terms of number and size of the predicted DMRs.  
 +
* Because the authors assume that there are no regions with the same methylation level, the performance does not assess false-positives. The more DMRs, the better the performance.
 +
Therefore an increasing number of predicted DMRs always increases the performance.
 +
* The dependency of the bin size or of the width of smoothing windows is evaluated by plotting coverage over window-/bin size.
 +
* The predicted DMRs depend on the choice of thresholds and configuration parameters. These dependencies are not evaluated.
 +
 
 +
 
 
==== Design for Outcome O2 ====
 
==== Design for Outcome O2 ====
* The outcome was generated for ...
+
* A distince data set as for outcome O1 and O3 has been analyzed, namely ...
* Configuration parameters were chosen ...
+
* To evaluate occurance of false-positives, random permutations of the genome positions of the individual CpGs was used which prevents occurance of DMRs (if the overall occurance of differential methylation in the original data set is small enough).
* ...
 
  
...
 
  
==== Design for Outcome O ====
+
==== Design for Outcome O3 ====
* The outcome was generated for ...
+
* The results were obtained for the comparison of Arabidopsis thaliana wildtype and a quadruple mutant
* Configuration parameters were chosen ...
+
termed ddcc(drm1 drm2 cmt3 cmt2) which leads to complete loss of methylation in CpG, CpHpG and CpHpH contexts.
* ...
+
* The analysis was done for CpHpG (sometimes also termed CHG) and CpHpH (sometimes also called CHH) contexts.
 +
 
 +
 
 +
==== Design for Outcome O4 ====
 +
* The level of overlap of CpHpG DMRs with 165 genes that were upregulated in endosperm and 153 genes are upregulated in embryo according to the publication where the data comes from was used to assess performance.
 +
*
 +
 
 +
 
 +
==== Design for Outcome O5 ====
 +
* WT and met1-3 mutant A. thaliana and used one biological replicate from (20) and the second biological replicate from (26)
 +
* Only chromosome 1 was analyzed because of huge computation times
 +
* For this analysis, no "scrambled" data was used to account for the false-positives.
 +
 
  
 
=== Further comments and aspects ===
 
=== Further comments and aspects ===
 
The choice for using methylSig and methylKit as reference is that only these tools (and methylPipe) can handle non-CpG sequence contexts.
 
The choice for using methylSig and methylKit as reference is that only these tools (and methylPipe) can handle non-CpG sequence contexts.
 +
  
 
=== References ===
 
=== References ===

Revision as of 09:06, 23 January 2019

1 DMRcaller: a versatile R/Bioconductor package for detection and visualization of differentially methylated regions in CpG and non-CpG contexts

Catoni M, Tsang JM, Greco AP, Zabet NR DMRcaller: a versatile R/Bioconductor package for detection and visualization of differentially methylated regions in CpG and non-CpG contexts. Nucleic Acids Res. 2018 Nov 2;46(19):e114

Permanent link to the paper


1.1 Summary

The paper considers identification of differentially methylated regions (DMRs) from bisulfite sequencing data (BSSEQ). A new package is introduced. The package allows choosing between three approaches (individual cytosines, pooled tiling-intervals or smoothed data) for merging data from individuals cytosines. Several test statistics can be chosen (Fisher's Exact test, score-test, beta regression). Differential positions/bins/regions are selected based on three requirements: p-value<threshold, methylation difference>threshold, coverage>threshold.

Publicly available data from Arabidopsis thaliana, rice and human cell lines were analyzed.

The performance of the presented approach was assessed by comparing with three other methods (methylKit, methylSig and methylPipe).


1.2 Study outcomes

1.2.1 Outcome O1

The performance of DMR-caller-B (B denotes "binning") and DMR-caller-NF (NF denotes "noise-filter"=smoothing) is superior to methlySig and methylKit. DMR-caller-NF exhibits the best overall performance.

Outcome O1 is presented as Figure 2 in the original publication.

1.2.2 Outcome O2

A "scrambled data set" was generated by random permutation of the genomic positions of a IMR90 versus H1 The difference between of the DMR genome coverage calculated for real and scrambled data has been used for assessing the accuracy.

Outcome O2 is presented as Figure 3 in the original publication.

1.2.3 Outcome O3

For non-CpG context, the following results were obtained:

  • For CpHpG, DMRcaller-NF and DMR-caller-B are superior to methylSig and methylKit
  • For CpHpH, DMRcaller-NF has a very weak performance. In this context, methylSig or DMRcaller-B are superior (depending on the window size).

Outcomes O3 is presented as Figure 3 in the original publication.


1.2.4 Outcome O4

  • As validation, publicly available BSSEQ data from rice comparing endosperm to embryo was used to evaluate the DMRs predicted by DMCcaller-B with gene annotation.
  • DMCcaller-B outperformed methylKit and methylSig in both total number of DMRs called and proportion of endosperm-preferred genes overlapping DMRs

Outcomes O4 are presented as Figure 6 and Supplementary Figure S8 in the original publication.


1.2.5 Further outcomes

  • Boxplots for the run-times of DMRcaller-B, DMRcaller-NF, MethylKit, MethylSig, MethylPipe are provided. MithylKit and MethylSig are fastest.
  • It was observed that the difference between the best method (DMR-caller-NF) and other methods is mainly the length of the predicted DMRs, i.e. in most cases the same regions were found but DMR-caller-NF predicts larger intervals as DMRs.

DMRs which are only predicted by DMRcaller-NF are short indicating enhanced abilities of DMRcaller-NF to identify short DMRs.

  • For the comparison H1 vs. IMR90, the overlap between the DMR predictions of the different methods was between 62% and 78% (Supplementary Figure S7A).
  • methylPipe could not be applied partly because methyPipe "cannot call DMRs that contain less than five differentially methylated CpGs" and could therefore not be applied for scrambled data. The overlap of methylPipe with other methods is rather small and shown as Supplementary Figure S7.


1.3 Study design and evidence level

1.3.1 General aspects

Smoothing approaches are base on the assumption that neighboring methylation sites have correlated methylation levels. The paper provides observed correlations for CpG, CpHpG and CpHpH contexts. Methylation of CpG is correlated with Methylation of CpHpG is weakly correlated (xxx) and methylation at CpHpH sites has a very low correlation (xx) suggesting that smoothing might yield weak performance for CpHpH context.

The study designs for describing specific outcome (O1-O3) are listed in the following subsections:

1.3.2 Design for Outcome O1

  • Wildtype was compared with methyltransferase knockouts and it was assumed that this leads to methylation differences.

The authors conclude that all predicted DMRs for CpG context are therefore "true positives" which seems a very questionable assumption because unmethylated regions remain unmethylated.

  • The authors assess performance in terms of genome coverage of the predicted DMRs, i.e. in terms of number and size of the predicted DMRs.
  • Because the authors assume that there are no regions with the same methylation level, the performance does not assess false-positives. The more DMRs, the better the performance.

Therefore an increasing number of predicted DMRs always increases the performance.

  • The dependency of the bin size or of the width of smoothing windows is evaluated by plotting coverage over window-/bin size.
  • The predicted DMRs depend on the choice of thresholds and configuration parameters. These dependencies are not evaluated.


1.3.3 Design for Outcome O2

  • A distince data set as for outcome O1 and O3 has been analyzed, namely ...
  • To evaluate occurance of false-positives, random permutations of the genome positions of the individual CpGs was used which prevents occurance of DMRs (if the overall occurance of differential methylation in the original data set is small enough).


1.3.4 Design for Outcome O3

  • The results were obtained for the comparison of Arabidopsis thaliana wildtype and a quadruple mutant

termed ddcc(drm1 drm2 cmt3 cmt2) which leads to complete loss of methylation in CpG, CpHpG and CpHpH contexts.

  • The analysis was done for CpHpG (sometimes also termed CHG) and CpHpH (sometimes also called CHH) contexts.


1.3.5 Design for Outcome O4

  • The level of overlap of CpHpG DMRs with 165 genes that were upregulated in endosperm and 153 genes are upregulated in embryo according to the publication where the data comes from was used to assess performance.


1.3.6 Design for Outcome O5

  • WT and met1-3 mutant A. thaliana and used one biological replicate from (20) and the second biological replicate from (26)
  • Only chromosome 1 was analyzed because of huge computation times
  • For this analysis, no "scrambled" data was used to account for the false-positives.


1.4 Further comments and aspects

The choice for using methylSig and methylKit as reference is that only these tools (and methylPipe) can handle non-CpG sequence contexts.


1.5 References

DMRcaller Bioconductor package

methylKit package on github

methylSig package on github

methylPipe Bioconductor package