🚨🚧 This page was created to improve pb-CpG-tools run times and issues with concurrent.futures. These issues have been addressed in the latest versions of pb-CpG-tools; therefore, this page is now archived and users should refer to pb-CpG-tools for methylation analysis. 🚧🚨
Today, we cook some High Fidelity Meth-ylation scores. This is a collection of tools for the analysis of CpG/5mC data from PacBio HiFi reads aligned to a reference genome (e.g., an aligned BAM).
This is a modified version of PacBio's pb-CpG-tools; whereby, I broke apart their aligned_bam_to_cpg_scores.py
script to make use of GNU-Parallel
rather than concurrent.futures
within Python. I also took the opportunity to rename the scripts to be Breaking Bad themed, because why not...
Walter.py
: Walter is the brains of the operation...gathering data, making decisions.Heisenberg.py
: Heisenberg is doing the behind the scenes cooking of some (Hi)gh (Fi)delity (Meth)ylation scores.
Together these two scripts obtain a list of CpG/5mC sites and modification probabilities from a BAM file of HiFi reads aligned to a reference genome.
To use these tools, the HiFi reads should already contain 5mC base modification tags, generated on-instrument or by using primrose. The aligned BAM should also be sorted and indexed.
There are several Python packages required to run the Walter.py
script. These can easily be installed using conda
and the conda_env_cpg.yaml
file provided:
# create conda environment
conda env create -f conda_env_cpg.yaml
# activate environment
conda activate cpg
# Install GNU-Parallel
conda install parallel
These packages could also be installed manually. However, the specific package versions listed in the yaml file must be used to ensure a stable environment.
The input BAM should be sorted and have an associated index file (.bai
or .csi
). The HiFi reads in the aligned input BAM file must contain the SAM tags MM
and ML
which encode base modifications. The tags are described in the SAM tag specification document. These are the default tags generated by primrose for HiFi CpG datasets.
Tag | Type | Description |
---|---|---|
MM |
Z |
Base modifications / methylation |
ML |
B,C |
Base modification probabilities |
Notes for ML
: The continuous probability range of 0.0 to 1.0 is remapped to
the discrete integers 0 to 255 inclusively. The probability range corresponding
to an integer N is N/256
to (N + 1)/256
.
These tools are explicitly designed for 5mC tags (e.g., MM:Z:C+m
) and do not support other modification types or tags in multiple-modification format.
Usage:
python Walter.py -b input.bam -f ref.fasta -o label [options]
Mandatory arguments:
-b, --bam FILE The aligned BAM file.
-f, --fasta FILE The reference fasta file.
-o, --output_label STR Label for output files, which results in [label].bed/bw.
Optional arguments:
-p, --pileup_mode count/model Use a model-based approach to score modifications across sites
(model) or a simple count-based approach (count). [count]
-d, --model_dir STR Full path to the directory containing the model (*.pb files) to
load. [None]
-m, --modsites denovo/reference Only output CG sites with a mod probability > 0 (denovo),
or output all CG sites based on the supplied reference
fasta (reference). [denovo]
-c, --min_coverage INT Minimum coverage required for filtered outputs. [4]
-q, --min_mapq INT Ignore alignments with MAPQ < N. [0]
-a, --hap_tag STR The SAM tag containing haplotype information. [HP]
-s, --chunksize INT Break reference regions into chunks of this size for parallel
processing. [500000]
-t, --threads INT Number of threads for parallel processing. [1]
-h, --help Show this help message and exit.
The -p, --pileup_mode
selects the method used to calculate modification probabilities.
model
: This approach is preferred. It uses distributions of modification scores and a machine-learning model to calculate the modification probabilities across CG sites. Using this option requires providing the path to the directory containing the model files with the-d, --model_dir
flag. The required model is available in thepileup_calling_model/
directory.count
: This method is simplistic. For a given site, the number of reads with a modification score of >0.5 and <0.5 are counted and the modification probability is given as a percentage.
The -m, --modsites
flag determines which sites will be reported.
denovo
: This option will identify and output all CG sites found in the consensus sequence from the reads in the pileup. This allows reporting of CG sites with zero modification probability. This mode does not ensure that the reference also displays a CG site (e.g., there could be sequence mismatches between the reads and reference).reference
: This option will identify and output all CG sites found in the reference sequences. This allows reporting of CG sites with zero modification probability. This mode does not ensure that aligned reads also display a CG site (e.g., there could be sequence mismatches between the reads and reference).
Using the -a, --hap_tag
flag allows an arbitrary SAM tag to be used to identify haplotypes, rather than the default HP
tag. The haplotype values must be 0
, 1
, and 2
, where 0
is not assigned/ambiguous.
There are bed format (.bed
) and bigwig format (.bw
) files generated for the complete read set and each separate haplotype (when available). Per-read information is also generated as a tab-delimited file (note that these are the raw-calls from Primrose and have not been modeled)
In addition, there are coverage filtered version of these files produced that are based on the minimum coverage value set for -c, --min_coverage
.
Two log files are also produced ([label].Walter.log
& [label].Heisenberg.log
), which contains useful information.
If haplotype information is absent, the following 4 files are expected:
[label].Heisenberg.reads.gz
[label].Walter.total.[pileup_mode].bed
[label].Walter.total.[pileup_mode].bw
[label].Walter.total.[pileup_mode].mincov[X].bed
[label].Walter.total.[pileup_mode].mincov[X].bw
If haplotype information is present, the following 12 files are expected:
[label].Heisenberg.reads.gz
[label].Walter.total.[pileup_mode].bed
[label].Walter.hap1.[pileup_mode].bed
[label].Walter.hap2.[pileup_mode].bed
[label].Walter.total.[pileup_mode].bw
[label].Walter.hap1.[pileup_mode].bw
[label].Walter.hap2.[pileup_mode].bw
[label].Walter.total.[pileup_mode].mincov[X].bed
[label].Walter.hap1.[pileup_mode].mincov[X].bed
[label].Walter.hap2.[pileup_mode].mincov[X].bed
[label].Walter.total.[pileup_mode].mincov[X].bw
[label].Walter.hap1.[pileup_mode].mincov[X].bw
[label].Walter.hap2.[pileup_mode].mincov[X].bw
The bed file columns will differ between the -p model
and -p count
methods, but both share the first six columns:
- reference name
- start coordinate
- end coordinate
- modification probability
- haplotype
- coverage
For -p count
, four additional columns are present:
- modified site count
- unmodified site count
- avg mod score
- avg unmod score
For -p model
, three additional columns are present:
- estimated modified site count (extrapolated from model modification probability)
- estimated unmodified site count (extrapolated from model modification probability)
- discretized modification probability (calculated from estimated mod/unmod site counts)
The bigwig files are in an indexed binary format and contain columns 1-4 listed above, and are preferred for loading CpG/5mC tracks in IGV.
An aligned BAM file containing HiFi reads with 5mC tags (HG002.GRCh38.haplotagged.bam
) is freely available for download: https://downloads.pacbcloud.com/public/dataset/HG002-CpG-methylation-202202/
The sample is HG002/NA24385 from the Human Pangenome Reference Consortium HG002 Data Freeze v1.0, and is aligned to GRCh38 (https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz). There are also four unaligned bam files containing the HiFi reads.
Walter & Heisenberg are also much faster than the original pb-CpG-tools aligned_bam_to_cpg_scores.py
script. Shown below is a table with the original and updated runtimes.
The current -s, --chunksize
default of 500,000 requires 1-3 Gb memory per thread with a 30X coverage aligned BAM. A higher coverage dataset would use more memory per thread.
Benchmarks were run using the HG002.GRCh38.haplotagged.bam
dataset described in the above section. Peak memory was estimated using 3Gb per thread.
Threads ( -t ) |
Chunk Size ( -s ) |
Original Wall Time (h:m:s ) |
Walter Wall Time (h:m:s ) |
Estimated Peak Memory |
---|---|---|---|---|
48 | 500000 (default) | 3:15:56 | 2:23:00 | 144 Gb |
36 | 500000 (default) | 3:58:06 | 2:40:00 | 108 Gb |
24 | 500000 (default) | 5:19:03 | 3:24:03 | 72 Gb |
12 | 500000 (default) | 8:58:01 | 6:34:54 | 36 Gb |
THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.