MethCP: Differentially Methylated Region Detection with Change Point Models (bioRxiv)
1 MethCP: Differentially Methylated Region Detection with Change Point Models
Boying Gong, Elizabeth Purdom, MethCP: Differentially Methylated Region Detection with Change Point Models, 2018, bioRxiv.
A new approach (MethCP) for the identification of differentially methylated regions (DMRS) of the DNA based on whole genome bisulfite sequencing data is supposed. The approach is developed for more complex design than two-group comparisons, e.g. for time course experiments. For the two-group setup, it is claimed that MethCP outperforms existing approaches.
1.2 Study outcomes
1.2.1 Outcome O1
For simulated data, the following outcomes were obtained (ROC curves, i.e. TPR vs. FPR for different local precision or local recall)
- Overall, metilene, MethCP-DSS, MethCP-MethylKit are superior to bsmooth, HMM-Fisher, DSS and methylKit.
- DSS has rather good performance when the local recall is controlled, but rather weak when the local precision is increased.
- MethCP better controls the desired FPR than metilene at a significance level 0.05.
Outcome O1 is presented as Figure 2 in the original publication.
1.2.2 Outcome O2
For simulating small effect sizes (2.5%, 5%, 10%, 20%), the following result is obtained:
- For <10% it is claimed that only MethCP can accurately predict DMRs (although only results for MethCP and metilene are plotted).
Outcome O2 is presented as Figure 3 in the original publication.
1.2.3 Outcome O3
For randomly dividing six control samples in two groups with three replicates and by randomly permute over samples for each CgG (termed "1. permutation" below), the following performance was observed:
- HMM-Fisher performes best. It yiels almost no false-positive predictions.
- MethCP-DSS and MethCP-MethylKit have good performance (around 20-40 false positive DMRs and less than 0.0005 for the proportion of CpGs)
- Bsmooth, DSS, methylKit and metilene perform worst (more than 150 false DMRs and a proportion of around 0.008-0.0022 as wrongly predicted CpGs)
Outcome O3 is presented as Figure 4 panels (c) and (e) in the original publication.
1.2.4 Outcome O4
For randomly dividing six control samples in two groups with three replicates and by randomly permute the CpG positions within each sample (termed "2. permutation" below), the following performance was observed:
- MethCP- performs best
- Bsmooth, HMM-Fisher, methylKit, MethCP-DSS and MethCP-MethylKit have very few (almost no) wrong predictions.
- DSS and metilene performe worst (around 60-90 wrong DMRs, around 0.0005 wrongly predicted CpG proportions)
Outcome O4 is presented as Figure 4 panels (d) and (f) in the original publication.
1.2.5 Further outcomes
If intended, you can add further outcomes here.
1.3 Study design and evidence level
1.3.1 General aspects
- The different methods usually apply a coverage filter, i.e. the observed methylation ratio is removed if it is based on few reads. This filter step was entirely removed to obtain comparable outcomes which does not depend on method-specific filter thresholds. The drawback, however, is that the outcomes less comparable to outcomes obtained in the detault setup (with coverage filter).
- Only regions with at least 3 CpGs and at least 0.1 for the "mean methylation level" were considered as DMRs.
- For bsmooth, the smoothing window was shortend from 1000 bps (default) to 500 bps because it yields better results for the simulated data set.
- DSS was applied with the "moving average smoothing" option.
- For methylKit, adjacent DMCs were merged manually as DMRs
- "All other parameters other than the significance level (test statistics cutoffs) were left at the default values."
1.3.2 Design for Outcome O1 and O2
- The outcome was generated for simulated data for a two-group comparision with 3 vs. 3 replicates.
- Some details are provided about how the data has been simulated (in supplement section B, page 14). However, there are no plots available and no other procedures for comparing simulated and real-world data. Therefore, it is difficult to assess, how good the simulated data corresponds/agrees with real world measurements and therefore how good the outcomes generalize to application settings
- It seems that only one random realisation of the data has been analyzed
- To guarantee typical read coverage and methylation ratios, a publicly available human data set (GSE48580) was used.
1.3.3 Design for Outcome O3 and O4
- Publicly available data for Arabidopsis Thaliana [Coleman-Derr et al., 2012] with GEO accession number GSE39045 was analyzed
- Wildtype data was compared to H2Z.Z mutant
- The data had six replicates in both groups
- For assessing false-positives, the six control replicates were randomly assinged to two groups with three replicates AND by performing one of the two additional permutation approaches:
- Outcome O3: The two counts for methylated and unmethylated were permuted across samples for each CpG. This breaks local correlations within a sample but preserved correlations which occur over all/several samples. It also prohibits global differences between the samples in the average methylation level.
- Outcome O4: The CpG positions within a sample were permuted which breaks local correlations along the genome. This does not prevent potential global difference between the methylation levels of the individual samples.
1.4 Further comments and aspects
Coleman-Derr, D. and Zilberman, D. 2012. Deposition of histone variant h2a. z within gene bodies regulates responsive genes. PLoS genetics 8, e1002988