Comparative Genomic Read Depth
CGRD is a pipeline to compare sequencing read depths from two samples along a reference genome. Three major steps are involved:
- define effective genomic bins each of which harbors certain non-repetitive sequences
- align reads and count read depths per bin for both samples
- combine neighboring bins with similar fold changes in read depth between the two samples (segmentation)
From the result, genomic segments with similar and differential (higher or lower) read depths are obtained. Therefore, genomic copy number variation (CNV) based on read depths can be extracted from the result and visualized on the genome map.
G Lin, C He, J Zheng, DH Koo, H Le, H Zheng, D Koo, H Le, H Zheng, TM Tamang, J Lin, Y Liu, M Zhao, Y Hao, F McFarland, B Wang, Y Qin, H Tang, DR McCarty, H Wei, MJ Cho, S Park, H Kaeppler, S Kaeppler, Y Liu, NM Springer, PS Schnable, G Wang, FF White, S Liu. (2021). Chromosome-level genome assembly of a regenerable maize inbred line A188, Genome Biology, 22:175
v0.3.7: fixed a bug for checking input files
v0.3.6: added a step to check if input files exist and fix a bug for version checking
v0.3.5: added the parameter of --adj0 to allow a further adjustment of the logRD mode to 0
v0.3.4: added the step to check required software packages and fixed the issue associated with --knum
- reference genome (FASTA format)
- FASTQ reads or an sorted BAM file of sample 1
- FASTQ reads or an sorted BAM file of sample 2
Note:
- If BAM files were provided, BAM index files are located at the same directory as BAM files.
- FASTQ data are whole genome sequencing data. The higher sequencing depth is, the smaller the bin size could be.
Running is easy but might takes days if the genome is large and high-depth sequencing data are produced.
If no BAM alignments are ready, run:
perl <path-to-cgrd>/cgrd --ref <fas> \
--subj ref --sfq1 <subject fq1> --sfq2 <subject fq2> \
--qry qry --qfq1 <query fq1> --qfq2 <query fq2>
If BAM alignments are ready, run:
perl <path-to-cgrd>/cgrd --ref <fas> \
--subj ref --sbam <subject bam> \
--qry qry --qbam <query bam>
The following packages are required:
- jellyfish: to generate k-mers from a FASTA file
- Bowtie: to align and determine k-mer positions on the genome
- BWA: to align reads to the reference genome
- samtoos: to convert SAM to BAM
- bedtools: to determine read counts per genomic bin
- pandoc: to create a html report
- R: to perform CNV analysis and create a report
- R pakages: rmarkdown, knitr, DNAcopy
If all the packages are installed and commands are in the paths. You can directly copy CGRD for your uses.
git clone https://github.com/liu3zhenlab/CGRD.git
cd CGRD
perl cgrd
git clone https://github.com/liu3zhenlab/CGRD.git
cd CGRD
conda env create -f cgrd.yml
conda activate cgrd
perl cgrd
conda create -n cgrd
conda activate cgrd
conda install -c bioconda jellyfish bowtie bwa bedtools pandoc samtools=1.9 minimap2
conda install -c r r-base r-knitr r-rmarkdown
conda install -c bioconda bioconductor-DNAcopy
# after all the installation:
git clone https://github.com/liu3zhenlab/CGRD.git
cd CGRD
perl cgrd
- jellyfish-2.2.10
- bowtie-1.2.3
- bwa-0.7.17
- samtools-1.9
- bedtools-2.29.0
- pandoc-2.2.3.2-0
- r-base-3.6.1
- r-rmarkdown-1.12
- r-knitr-1.22
- dnacopy-1.58.0
Note: the installation may take 1-2 hours.
Here is a warning message during the report generation, which does not affect the result but needs to be solved. "'mode(width)' and 'mode(height)' differ between new and previous"
Please report any bugs or suggestion on github or by email to Sanzhen Liu (liu3zhen@ksu.edu).
CGRD is distributed under MIT licence.
The idea was developed by Sanzhen Liu when he was in Schnable lab at Iowa State University. Guifang Lin tested the scripts. Thank suggestions from Ha Le.