Difference between revisions of "DMRcaller: a versatile R/Bioconductor package for detection and visualization of differentially methylated regions in CpG and non-CpG contexts"

Line 9: Line 9:
  
 
=== Summary ===
 
=== Summary ===
The paper considers identification of differentially methylated regions (DMRs) from bisulfite sequencing data (BSSEQ). A new package is introduced. The package allows choosing between three approaches (individual cytosines, pooled tiling-intervals or smoothed data) for merging data from individuals cytosines. Several test statistics can be chosen (Fisher's Exact test, score-test, beta regression). Differential positions/bins/regions are selected based on three requirements: p-value<threshold, methylation difference>threshold, coverage>threshold.
+
The paper considers identification of differentially methylated regions (DMRs) from bisulfite sequencing data (BSSEQ). A new package (DMRcaller) is introduced. The DMRcaller package allows choosing between three conceptional approaches for merging information from individuals cytosines (analysis of individual cytosines, of pooled tiling-intervals, or of smoothed data). Three test statistics can be chosen (Fisher's Exact test, score-test, beta regression). Differential positions/bins/regions are selected based on several requirements e.g. significance<threshold, methylation difference>threshold, coverage>threshold.
  
Publicly available data from Arabidopsis thaliana, rice and human cell lines were analyzed.
+
Publicly available data from Arabidopsis thaliana (GSM2384978, GSM2384980, [26]) rice (GSM560563, GSM560562, [27]), and human IMR90 and H1 cell lines [15] were analyzed.
 
 
The performance of the presented approach was assessed by comparing with three other methods (methylKit, methylSig and methylPipe).
 
  
 +
The performance of the DMRcaller was assessed by comparing with 2-4 other methods (methylKit, methylSig and  partly with methylPipe and DSS).
  
 
=== Study outcomes ===
 
=== Study outcomes ===
  
 
==== Outcome O1 ====
 
==== Outcome O1 ====
The performance of DMR-caller-B (B denotes "binning") and DMR-caller-NF (NF denotes "noise-filter"=smoothing) is superior to methlySig and methylKit.
+
For Arabidopsis and CpGs, the performances of DMR-caller-B (B denotes "binning") and DMR-caller-NF (NF denotes "noise-filter"=smoothing) are found as superior to methlySig and methylKit.
 
DMR-caller-NF exhibits the best overall performance.
 
DMR-caller-NF exhibits the best overall performance.
  
Line 25: Line 24:
  
 
==== Outcome O2 ====
 
==== Outcome O2 ====
A "scrambled data set" was generated by random permutation of the genomic positions of a
+
For the comparison of CpG methylation of two human cell lines, the performances of DMRcaller-NF and DMR-caller-B were found as superior to methylSig and methylKit.
IMR90 versus H1
 
The difference between of the DMR genome coverage calculated for real and scrambled data has been used for assessing the accuracy.  
 
  
 
Outcome O2 is presented as Figure 3 in the original publication.
 
Outcome O2 is presented as Figure 3 in the original publication.
  
 
==== Outcome O3 ====
 
==== Outcome O3 ====
For non-CpG context, the following results were obtained:
+
For non-CpG contexts, the following results were obtained:
 
* For CpHpG, DMRcaller-NF and DMR-caller-B are superior to methylSig and methylKit
 
* For CpHpG, DMRcaller-NF and DMR-caller-B are superior to methylSig and methylKit
* For CpHpH, DMRcaller-NF has a very weak performance. In this context, methylSig or DMRcaller-B are superior (depending on the window size).
+
* For CpHpH, DMRcaller-NF has a very weak performance. In this context, methylSig or DMRcaller-B have best performances (depending on the window size).
 
 
Outcomes O3 is presented as Figure 3 in the original publication.
 
  
 +
Outcomes O3 is presented as Figure 5 in the original publication.
  
 
==== Outcome O4 ====
 
==== Outcome O4 ====
* As validation, publicly available BSSEQ data from rice comparing endosperm to embryo was used to evaluate the DMRs predicted by DMCcaller-B with gene annotation.  
+
As validation, BSSEQ data from rice comparing endosperm to embryo was used to evaluate the DMRs predicted by DMCcaller-B with gene expression differences. DMCcaller-B outperformed methylKit and methylSig in terms of total number of predicted DMRs and overlap of DMRs with regulated genes.
* DMCcaller-B outperformed methylKit and methylSig in both total number of DMRs called and proportion of endosperm-preferred genes overlapping DMRs
 
  
 
Outcomes O4 are presented as Figure 6 and Supplementary Figure S8 in the original publication.
 
Outcomes O4 are presented as Figure 6 and Supplementary Figure S8 in the original publication.
 
  
 
==== Outcome O5 ====
 
==== Outcome O5 ====
Line 88: Line 82:
 
* A distince data set as for outcome O1 and O3 has been analyzed, namely ...
 
* A distince data set as for outcome O1 and O3 has been analyzed, namely ...
 
* To evaluate occurance of false-positives, random permutations of the genome positions of the individual CpGs was used which prevents occurance of DMRs (if the overall occurance of differential methylation in the original data set is small enough).
 
* To evaluate occurance of false-positives, random permutations of the genome positions of the individual CpGs was used which prevents occurance of DMRs (if the overall occurance of differential methylation in the original data set is small enough).
 +
 +
A "scrambled data set" was generated by random permutation of the genomic positions of a
 +
The difference between of the DMR genome coverage calculated for real and scrambled data has been used for assessing the accuracy.
  
  
Line 114: Line 111:
  
 
=== References ===
 
=== References ===
 +
[10] Feng,H., Conneely,K.N. and Wu,H. (2014) A Bayesian hierarchical model to detect differentially methylated loci from single nucleotide resolution sequencing data. Nucleic Acids Res., 42, e69
 +
 +
[15] Lister,R., Pelizzola,M., Dowen,R.H., Hawkins,R.D., Hon,G., Tonti-Filippini,J., Nery,J.R., Lee,L., Ye,Z., Ngo,Q.-M. et al. (2009) Human DNA methylomes at base resolution show widespread epigenomic differences. Nature, 462, 315–322.
 +
 +
[26] Catoni,M., Griffths,J., Becker,C., Zabet,N.R., Bayon,C., Dapp,M., Lieberman-Lazarovich,M., Weigel,D. and Paszkowski,J. (2017) DNA sequence properties that determine susceptibility to epiallelic switching. EMBO J., 36, 617–628.
 +
 +
[27] Zemach,A., Kim,M.Y., Silva,P., Rodrigues,J.A., Dotson,B., Brooks,M.D. and Zilberman,D. (2010) Local DNA hypomethylation activates genes in rice endosperm. Proc. Natl. Acad. Sci. U.S.A., 107, 18729–18734.
 +
  
 
[http://bioconductor.org/packages/DMRcaller DMRcaller Bioconductor package]
 
[http://bioconductor.org/packages/DMRcaller DMRcaller Bioconductor package]

Revision as of 11:44, 23 January 2019

1 DMRcaller: a versatile R/Bioconductor package for detection and visualization of differentially methylated regions in CpG and non-CpG contexts

Catoni M, Tsang JM, Greco AP, Zabet NR DMRcaller: a versatile R/Bioconductor package for detection and visualization of differentially methylated regions in CpG and non-CpG contexts. Nucleic Acids Res. 2018 Nov 2;46(19):e114

Permanent link to the paper


1.1 Summary

The paper considers identification of differentially methylated regions (DMRs) from bisulfite sequencing data (BSSEQ). A new package (DMRcaller) is introduced. The DMRcaller package allows choosing between three conceptional approaches for merging information from individuals cytosines (analysis of individual cytosines, of pooled tiling-intervals, or of smoothed data). Three test statistics can be chosen (Fisher's Exact test, score-test, beta regression). Differential positions/bins/regions are selected based on several requirements e.g. significance<threshold, methylation difference>threshold, coverage>threshold.

Publicly available data from Arabidopsis thaliana (GSM2384978, GSM2384980, [26]) rice (GSM560563, GSM560562, [27]), and human IMR90 and H1 cell lines [15] were analyzed.

The performance of the DMRcaller was assessed by comparing with 2-4 other methods (methylKit, methylSig and partly with methylPipe and DSS).

1.2 Study outcomes

1.2.1 Outcome O1

For Arabidopsis and CpGs, the performances of DMR-caller-B (B denotes "binning") and DMR-caller-NF (NF denotes "noise-filter"=smoothing) are found as superior to methlySig and methylKit. DMR-caller-NF exhibits the best overall performance.

Outcome O1 is presented as Figure 2 in the original publication.

1.2.2 Outcome O2

For the comparison of CpG methylation of two human cell lines, the performances of DMRcaller-NF and DMR-caller-B were found as superior to methylSig and methylKit.

Outcome O2 is presented as Figure 3 in the original publication.

1.2.3 Outcome O3

For non-CpG contexts, the following results were obtained:

  • For CpHpG, DMRcaller-NF and DMR-caller-B are superior to methylSig and methylKit
  • For CpHpH, DMRcaller-NF has a very weak performance. In this context, methylSig or DMRcaller-B have best performances (depending on the window size).

Outcomes O3 is presented as Figure 5 in the original publication.

1.2.4 Outcome O4

As validation, BSSEQ data from rice comparing endosperm to embryo was used to evaluate the DMRs predicted by DMCcaller-B with gene expression differences. DMCcaller-B outperformed methylKit and methylSig in terms of total number of predicted DMRs and overlap of DMRs with regulated genes.

Outcomes O4 are presented as Figure 6 and Supplementary Figure S8 in the original publication.

1.2.5 Outcome O5

  • DMR-caller-NF and DMR-caller-BR are superior.

Outcomes O5 are presented as Figure 7 in the original publication.


1.2.6 Further outcomes

  • Boxplots for the run-times of DMRcaller-B, DMRcaller-NF, MethylKit, MethylSig, MethylPipe are provided. MithylKit and MethylSig are fastest.
  • It was observed that the difference between the best method (DMR-caller-NF) and other methods is mainly the length of the predicted DMRs, i.e. in most cases the same regions were found but DMR-caller-NF predicts larger intervals as DMRs.

DMRs which are only predicted by DMRcaller-NF are short indicating enhanced abilities of DMRcaller-NF to identify short DMRs.

  • For the comparison H1 vs. IMR90, the overlap between the DMR predictions of the different methods was between 62% and 78% (Supplementary Figure S7A).
  • methylPipe could not be applied partly because methyPipe "cannot call DMRs that contain less than five differentially methylated CpGs" and could therefore not be applied for scrambled data. The overlap of methylPipe with other methods is rather small and shown as Supplementary Figure S7.


1.3 Study design and evidence level

1.3.1 General aspects

The paper presents a new approach (DMRcaller) and at the same times provides several analyses for comparing the performance of the new approach with existing algorithms. Such a study setting is very frequent in pratice but has a high risk for biased outcomes. One reason for such a bias might be that typically application examples are selected which nicely demonstrates performance benefits. Moreover, new approaches are often established if existing methods have minor performance in a new application setup. In such settings, it remains rather unclear how performance assessment translates to new application settings.


The predicted DMRs and the performance of the individual approaches depend on the choice of thresholds and configuration parameters. These dependencies are not evaluated. The following configuration parameters for DMCcaller are mentioned in the paper:

  1. the minimum difference between the methylation proportions
  2. the significance level for the statistical test
  3. the minimal average coverage level
  4. the minimum lenght of DMRs (smaller DMRs are removed)
  5. the minimum number of cytosines (DMRS with less are removed)

It seems that the choices for these thresholds are not stated in the paper.

The study designs for describing specific outcome (O1-O3) are listed in the following subsections:


1.3.2 Design for Outcome O1

  • Wildtype was compared with methyltransferase knockouts and it was assumed that this leads to methylation differences.

The authors conclude that all predicted DMRs for CpG context are therefore "true positives" which seems a very questionable assumption because unmethylated regions remain unmethylated.

  • The authors assess performance in terms of genome coverage of the predicted DMRs, i.e. in terms of number and size of the predicted DMRs.
  • Because the authors assume that there are no regions with the same methylation level, the performance does not assess false-positives. The more DMRs, the better the performance.

Therefore an increasing number of predicted DMRs always increases the performance.

  • The dependency of the bin size or of the width of smoothing windows is evaluated by plotting coverage over window-/bin size.


1.3.3 Design for Outcome O2

  • A distince data set as for outcome O1 and O3 has been analyzed, namely ...
  • To evaluate occurance of false-positives, random permutations of the genome positions of the individual CpGs was used which prevents occurance of DMRs (if the overall occurance of differential methylation in the original data set is small enough).

A "scrambled data set" was generated by random permutation of the genomic positions of a The difference between of the DMR genome coverage calculated for real and scrambled data has been used for assessing the accuracy.


1.3.4 Design for Outcome O3

  • The results were obtained for the comparison of Arabidopsis thaliana wildtype and a quadruple mutant

termed ddcc(drm1 drm2 cmt3 cmt2) which leads to complete loss of methylation in CpG, CpHpG and CpHpH contexts.

  • The analysis was done for CpHpG (sometimes also termed CHG) and CpHpH (sometimes also called CHH) contexts.


1.3.5 Design for Outcome O4

  • The level of overlap of CpHpG DMRs with 165 genes that were upregulated in endosperm and 153 genes are upregulated in embryo according to the publication where the data comes from was used to assess performance.


1.3.6 Design for Outcome O5

  • WT and met1-3 mutant A. thaliana and used one biological replicate from (20) and the second biological replicate from (26)
  • Only chromosome 1 was analyzed because of huge computation times
  • For this analysis, no "scrambled" data was used to account for the false-positives.
  • Only the CpG context was considered.
  • An additional method (DSS) was used for comparison in this analysis. This method, however, exhibited minor performance


1.4 Further comments and aspects

The choice for using methylSig and methylKit as reference is that only these tools (and methylPipe) can handle non-CpG sequence contexts.


1.5 References

[10] Feng,H., Conneely,K.N. and Wu,H. (2014) A Bayesian hierarchical model to detect differentially methylated loci from single nucleotide resolution sequencing data. Nucleic Acids Res., 42, e69

[15] Lister,R., Pelizzola,M., Dowen,R.H., Hawkins,R.D., Hon,G., Tonti-Filippini,J., Nery,J.R., Lee,L., Ye,Z., Ngo,Q.-M. et al. (2009) Human DNA methylomes at base resolution show widespread epigenomic differences. Nature, 462, 315–322.

[26] Catoni,M., Griffths,J., Becker,C., Zabet,N.R., Bayon,C., Dapp,M., Lieberman-Lazarovich,M., Weigel,D. and Paszkowski,J. (2017) DNA sequence properties that determine susceptibility to epiallelic switching. EMBO J., 36, 617–628.

[27] Zemach,A., Kim,M.Y., Silva,P., Rodrigues,J.A., Dotson,B., Brooks,M.D. and Zilberman,D. (2010) Local DNA hypomethylation activates genes in rice endosperm. Proc. Natl. Acad. Sci. U.S.A., 107, 18729–18734.


DMRcaller Bioconductor package

methylKit package on github

methylSig package on github

methylPipe Bioconductor package