-
Notifications
You must be signed in to change notification settings - Fork 19
Tutorial: Identification of circular DNA using Circle Map Realign
This is a tutorial explaining step by step how to go from the raw data (the fastq files) to a interpretable, tab separated BED file indicating the chromosomal coordinates of DNA circles. To make the tutorial, we have simulated Illumina reads coming from a circular DNA from some unknown region of the human genome. The aim of this tutorial is to use Circle-Map in order to recapitulate the circular DNA origin.
- GNU/Linux
- The Burrows Wheeler aligner (BWA)
- SAMtools
- Circle-Map
We can download the left and right Illumina reads using the following commands:
wget https://raw.githubusercontent.com/iprada/Circle-Map/master/tutorial/unknown_circle_reads_1.fastq
wget https://raw.githubusercontent.com/iprada/Circle-Map/master/tutorial/unknown_circle_reads_2.fastq
Once we have the raw data, we need to download the human genome reference, to map the reads back to the genome. In this tutorial we will use the hg38 version, which can be downloaded from UCSC using:
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
Now, decompress it using gzip:
gunzip -d hg38.fa.gz
Finally we need create the index of the reference genome, which are a set of files required by the aligner to map the reads to genome. They can be created using the following command:
bwa index hg38.fa
And with the following command for Circle-Map index:
samtools faidx hg38.fa
Now, we can begin our workflow to detect the circular DNA. We need to align the Illumina reads back to the reference genome. For this case, we will use the algorithm MEM under BWA and we will invoke it as follows:
bwa mem -q hg38.fa unknown_circle_reads_1.fastq unknown_circle_reads_2.fastq > unknown_circle.sam
This generates a SAM file, which contains the information about where and how the reads have aligned to the genome.
Notes:
- We use the -q option to assign independent mapping quality scores to split read alignments in BWA. This improves the estimation of the breakpoint graph weights in Circle-Map probabilistic model.
- In case you have a computer with many cores, you can win speed by increasing the number of threads using BWA -t option.
In this step, we will process the the SAM files in order to fit the formats required by Circle-Map.
We will sort the SAM file by read name and convert it to BAM ( SAM binary format) using:
samtools sort -n -o qname_unknown_circle.bam unknown_circle.sam
This file is used by Circle-Map to extract the circular DNA read candidates and to estimate the paremeters of the prior distribution.
We also need to sort the BAM file by the leftmost mapping coordinate:
samtools sort -o sorted_unknown_circle.bam unknown_circle.sam
This file is used by Circle-Map to calculate the genomic coverage metrics.
Notes:
- In case you have a computer with many cores, you can speed up this commands by increasing the number of threads using SAMtools sort -@ option
Now, we will extract the reads that indicate circular DNA rearrangements ( partially mapped reads and discordant read pairs) using Circle-Map:
Circle-Map ReadExtractor -i qname_unknown_circle.bam -o circular_read_candidates.bam
And we will sort the read candidates by coordinate with SAMtools:
samtools sort -o sort_circular_read_candidates.bam circular_read_candidates.bam
Notes:
- In case you have a computer with many cores, you can speed up this command by increasing the number of threads using SAMtools sort -@ option
To finish this step, we need to index the BAM file:
samtools index sort_circular_read_candidates.bam
samtools index sorted_unknown_circle.bam
This command will generate two BAI files, so that Circle-Map can retrieve efficiently the aligned reads from the sorted candidate circular and raw BAM files
Finally, we are ready to use Circle-Map Realign to detect the circular DNA. Circle-Map will take as input the following files we have generated above:
- sort_circular_read_candidates.bam: The circular DNA read candidates. Used for detecting the circular DNA
- qname_unknown_circle.bam: The read name sorted BAM file. Used for calculating the realignment prior
- sorted_unknown_circle.bam: The coordinate sorted BAM. Used for calculating circular DNA coverage metrics
- hg38.fa: Reference sequence. Used for realignment of the partially mapped reads.
And with following command:
Circle-Map Realign -i sort_circular_read_candidates.bam -qbam qname_unknown_circle.bam -sbam sorted_unknown_circle.bam -fasta hg38.fa -o my_unknown_circle.bed
Notes:
- In case you have a computer with many cores, you can speed up this command by increasing the number of threads using Circle-Map Realign -t option
You can look at the output using the command:
less -S my_unknown_circle.bed
Where we can see the generated BED file with the following circular DNA coordinates in BED format:
chr7 143911105 143917553 12 7 251.0 29.97115384615385 5.408852146902424 1.0 0.9988734509951184 0.0
Which indeed, has the coordinate from where we have simulate the circular DNA:
chr7 143911105 143917553