This Repo is intended to be run with docker
on Windows, Mac OS, or most Linux distros. Please see here for setting up Docker on your systems.
docker pull jhuaplbio/sandbox
Get the test data from here. Place it where you have this repository to make it easier to track.
This will simply start up your environment in the sandbox
conda env which contains all of your necessary binaries to run downstream processes
A list of installed binaries for bioinformatics teaching is as follows:
- minimap2 - Alignment
- bowtie2 - Alignment
- trimmomatic - Pre Processing
- NanoPlot - PLotting
- FastQC - Plotting
- bamstats - Plotting
- samtools - Parsing and Variant Calling
- bedtools - Parsing
- medaka - Consensus
- artic - Consensus
If we wanted to start up an interactive shell, we would simply override the entry command (the command run at start) with bash
like so:
To download and run our primary sandbox container do
docker container run -w /data -v $pwd/test-data:/data --rm -it jhuaplbio/sandbox bash
docker container run -w /data -v $PWD/test-data:/data --rm -it jhuaplbio/sandbox bash
To Exit a contauner once you're done, type: exit
You've successfully run your first interactive docker container!
You can of course run all things without being in the interactive shell, which simulates the linux environment you'd be working in. It would look like this for each and all commands going forwards (e.g. minimap2
, samtools
, etc.)
docker container run -w /data -v $pwd/test-data:/data --rm -it jhuaplbio/sandbox <command>
docker container run -w /data -v $PWD/test-data:/data --rm -it jhuaplbio/sandbox <command>
All commands going forward will be assuming that we're working in the interactive environment that was just mentioned so make sure to run that before continuing. Your terminal once in should look something like:
(base) root@addcf87af2d3:/#
Let's take a quick look at the envrionment we are in at startup. It is base
which does not have any of our dependencies. First, we must activate sandbox
with
conda activate sandbox
Your terminal screen should look like:
(sandbox) root@addcf87af2d3:/# Notice the (sandbox) on the left side
trimmomatic PE viruses/sars_cov_2/ERR6913101_1.fastq.gz \
viruses/sars_cov_2/ERR6913101_2.fastq.gz \
viruses_trimmed/ERR6913101_1.trim.fastq.gz \
viruses_trimmed/ERR6913101_1.untrim.fastq.gz \
viruses_trimmed/ERR6913101_2.trim.fastq.gz \
viruses_trimmed/ERR6913101_2.untrim.fastq.gz \
MINLEN:25
Your output files should be (since this is paired-end reads) as the 1.trim.fastq.gz
and 2.trim.fastq.gz
in viruses_trimmed
First, let's index our FASTA genome file. Since we are aligning against SARS-CoV-2, we will use the reference genome in the test data folder reference/nCoV-2019.reference.fasta
. We first need to head into the reference folder and run bowtie2-build
cd reference; bowtie2-build \
-f /data/reference/nCoV-2019.reference.fasta nCoV-2019; \
cd /data
Next, run alignment against the indexed reference FASTA file
mkdir -p alignments;
bowtie2 -x /data/reference/nCoV-2019 \
-1 viruses_trimmed/ERR6913101_1.trim.fastq.gz \
-2 viruses_trimmed/ERR6913101_2.trim.fastq.gz \
-S alignments/ERR6913101_alignments.sam
Let's now try minimap2, which is recommended for long reads (Oxford Nanopore runs)
mkdir -p /data/alignments/minimap2 && \
minimap2 \
-x map-ont \
-a /data/reference/nCoV-2019.reference.fasta \
-o /data/alignments/minimap2/NB03.sam demux-fastq_pass/NB03.fastq
For our short read alignments, let's work on that first...
samtools view \
-S -b alignments/ERR6913101_alignments.sam > alignments/ERR6913101_alignments.bam && \
rm alignments/ERR6913101_alignments.sam
Next, let's take a look at the equivalent with minimap2 for the long reads
samtools view \
-S \
-b /data/alignments/minimap2/NB03.sam > /data/alignments/minimap2/NB03.bam
With Bowtie2 Short Reads output
samtools sort alignments/ERR6913101_alignments.bam > alignments/ERR6913101_alignments_sorted.bam
With Minimap2 Long Reads output
samtools sort /data/alignments/minimap2/NB03.bam > /data/alignments/minimap2/NB03_sorted.bam
samtools index alignments/ERR6913101_alignments_sorted.bam
samtools index alignments/minimap2/NB03_sorted.bam
bcftools mpileup \
-f reference/nCoV-2019.reference.fasta \
alignments/ERR6913101_alignments_sorted.bam \
| bcftools call \
-mv \
-Ov \
-o alignments/ERR6913101_alignment.vcf
Again, lets do the same for our long read example fastq file
bcftools mpileup \
-f reference/nCoV-2019.reference.fasta \
alignments/minimap2/NB03_sorted.bam \
| bcftools call \
-mv \
-Ov \
-o alignments/minimap2/NB03_alignment.vcf
First, like always, let's focus on the short reads
samtools stats alignments/ERR6913101_alignments_sorted.bam > alignments/ERR6913101_alignments_sorted.stats
plot-bamstats \
-p alignments/plots_bamstats alignments/ERR6913101_alignments_sorted.stats
Next, the long reads doing the same process except on the minimap2 alignment output file (.bam)
samtools stats /data/alignments/minimap2/NB03.bam > /data/alignments/minimap2/NB03.stats
plot-bamstats \
-p alignments/minimap2/plots_bamstats alignments/minimap2/NB03.stats
NanoPlot \
--summary /data/20200514_2000_X3_FAN44250_e97e74b4/sequencing_summary_FAN44250_77d58da2.txt \
-o nanoplots
mkdir /data/fastqc_plots_fastq;
fastqc \
viruses_trimmed/ERR6913101_1.trim.fastq.gz viruses_trimmed/ERR6913101_2.trim.fastq.gz \
-o fastqc_plots
mkdir /data/fastqc_plots_bam;
fastqc \
-o fastqc_plots_bam \
-f bam alignments/ERR6913101_alignments_sorted.bam
mkdir /data/classifications;
kraken2 --db /data/databases/flukraken2 /data/metagenome/flu_ont/BC01.fastq --report /data/classifications/BC01.kraken.report
Try this out in pavian. First exit from the docker container with exit
, then run
docker container run -d --rm -p 8004:80 florianbw/pavian
docker container ls
. You can also remove all containers that might be running with docker container prune -f
mkdir -p /data/databases
wget ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/old/minikraken2_v2_8GB_201904.tgz -O /data/databases/minikraken2.tar.gz;
tar -xvzf --directory /data/databases/minikraken2 /data/databases/minikraken2.tar.gz
mkdir /data/classifications;
kraken2 --db /data/databases/minikraken2 --classified-out sample_metagenome#.fq metagenome/sample_metagenome.fastq --report classifications/sample_metagenome.kraken.report
ktUpdateTaxonomy.sh /data/databases/minikraken2
ktImportTaxonomy -i /data/classifications/ERR6913101.kraken.report -o /data/classifications/ERR6913101.kraken.html -tax /data/databases/minikraken2/
ktImportTaxonomy -i /data/classifications/sample_metagenome.kraken.report -o /data/classifications/sample_metagenome.kraken.html -tax /data/databases/minikraken2/
Now take a look at the resulting classifications/sample_metagenome.kraken.html
file by double-clicking it to open in your browser automatically
You need to use a different Docker image than sandbox for this. If you're in a Docker container already you need to type: exit
You may have already noticed that we have a demultiplexed set of files in the /data/demux-fastq_pass
folder. However, this is not likely to happen when you have a traditional run from MinKNOW. Usually it will spit out a set of fastq files rather than just one single one per sample. To gather them all up, we can run one of the subcommands called artic gather
. This will push all fastq files into a single one like what we see in /data/demux-fastq_pass
Let's first enter our new image by running. Remember that the name for the current directory is $PWD
for Unix (Mac or Linux) and $pwd
for Windows Powershell
To Demultiplex a run, use guppy barcoder
on the fastq_pass
folder of interest
docker container run -w /data -v $pwd/test-data:/data --rm --name articdemux genomicpariscentre/guppy guppy_barcoder --require_barcodes_both_ends -i /data/20200514_2000_X3_FAN44250_e97e74b4/fastq_pass -s /data/20200514_2000_X3_FAN44250_e97e74b4/demux --recursive
docker container run -w /data -v $PWD/test-data:/data --rm --name articdemux genomicpariscentre/guppy guppy_barcoder --require_barcodes_both_ends -i /data/20200514_2000_X3_FAN44250_e97e74b4/fastq_pass -s /data/20200514_2000_X3_FAN44250_e97e74b4/demux --recursive
You need to use a different Docker image than sandbox for this. If you're in a Docker container already you need to type: exit
Once you have a demultiplexed directory for your sample(s) you then need to gather them all into a single fastq file for the next step.
You have 2 options
You only need to select one of the options (Option A or B) below
The easier method is to simply use the artic
command from staphb/artic-ncov2019
to gather up all fastq files in your directory (demultiplexed result) and gather all into one file.
docker container run -w /data/20200514_2000_X3_FAN44250_e97e74b4/demux -v $pwd/test-data:/data --rm --name articgather staphb/artic-ncov2019 artic guppyplex --directory /data/20200514_2000_X3_FAN44250_e97e74b4/demux/barcode03 --output /data/demultiplexed/barcode03.fastq
docker container run -w /data/20200514_2000_X3_FAN44250_e97e74b4/demux -v $PWD/test-data:/data --rm --name articgather staphb/artic-ncov2019 artic guppyplex --directory /data/20200514_2000_X3_FAN44250_e97e74b4/demux/barcode03 --output /data/demultiplexed/barcode03.fastq
IF you look at the demux-fastq_pass
folder, you should see that the NB03.fastq
is the same length as the demultiplexed/barcode03.fastq
file you made just now.
mkdir -p test-data/demultiplexed
for fastq in $( ls test-data/20200514_2000_X3_FAN44250_e97e74b4/demux/barcode03 ); do \
cat test-data/20200514_2000_X3_FAN44250_e97e74b4/demux/barcode03/$fastq ; \
done > test-data/demultiplexed/barcode03.fastq
You now have created a merged barcode from all the fastqs that are attributed to it! Congrats!
Let's start easy and run artic. This is not in your sandbox Docker Image so first, we need to exit
primer-schemes
folder is seen below as /data/primer-schemes
We first need to run
docker container run -w /data/consensus/artic -v $pwd/test-data:/data --rm --name articconsensus staphb/artic-ncov2019 artic minion --medaka --medaka-model r941_min_high_g360 --strict --normalise 1000000 --scheme-directory /data/primer-schemes --scheme-version V3 --read-file /data/demultiplexed/barcode03.fastq nCoV-2019/V3 barcode03;
docker container run -w /data/consensus/artic -v $PWD/test-data:/data --rm --name articconsensus staphb/artic-ncov2019 artic minion --medaka --medaka-model r941_min_high_g360 --strict --normalise 1000000 --scheme-directory /data/primer-schemes --scheme-version V3 --read-file /data/demultiplexed/barcode03.fastq nCoV-2019/V3 barcode03;
Finally, to make a report, we can run multiqc .
in the /data/consensus folder to make a helpful report for variant information
docker container run -w /data/consensus -v $pwd/test-data:/data --rm --name articreport staphb/artic-ncov2019 bash -c "export LC_ALL=C.UTF-8; export LANG=C.UTF-8; multiqc . "
docker container run -w /data/consensus -v $PWD/test-data:/data --rm --name articreport staphb/artic-ncov2019 bash -c "export LC_ALL=C.UTF-8; export LANG=C.UTF-8; multiqc . "
Your output files will be in test-data/consensus/artic
, primarily you want the <barcode_name>.consensus.fasta
and multiqc_report.html
if you made it
Medaka is a good tool for generating consensues and variant calling for most organisms.
We have some demultiplexed SARS-CoV-2 test data available in the test directory. Lets try to make a consensus out of a sample's full fastq file. Let's ASSUME that we did not use any amplicons in the same fastq file we've been working on that is long reads
conda activate consensus && \
mkdir -p /data/consensus/medaka && \
cd /data/consensus/medaka
medaka_consensus \
-i /data/demux-fastq_pass/NB03.fastq \
-d /data/reference/nCoV-2019.reference.fasta \
-o /data/consensus/medaka \
-m r941_min_high_g360
But, what if we wanted to run our own images one by one NOT in an interactive environment.
If we're working with a select few organisms utilizing a primer scheme for Ilumina, we should opt for the ivar bioinformatics pipeline. This pipeline can use both nanopolish and medaka as it's primary variant and consensus calling set of scripts. In addition, it will perform the necessary trimming/pre-processing for you on your demultiplexed fastq files. It also comes with several subcommands that can help prep your data
First, we need to get the primer schemes for SARS-CoV-2 (our organism of interest for this example). These primers will be used for the artic minion process. Luckily, we have this present in your test-data/primer-schemes
This is pulling in the set of primer schemes directly from the artic pipeline toolkit's set of primer schemes for EBOLA, SARS-CoV-2, and Nipah. The primer-schemes are automatically in /data
as primer-schemes
. See here for a bit more on how we've set this up.
Since we've already indexed our SARS-CoV-2 reference genome in earlier steps with bowtie2-build
we will skip that step. While we have many commands at our disposal in the jhuaplbio/sandbox
environment, let's practice with using images for one specific command and nothing more, recreating what we just did with the Illumina data but this time more trustworthy.
For reference, these are reads obtained from the (Viral Amplicon Sequencing)[https://sandbox.bio/tutorials?id=viral-amplicon] tutorial.
Command | Platform |
---|---|
`docker container run -w /data -v $pwd/test-data:/data --rm --name artic staphb/bowtie2 bash -c "bowtie2 -x /data/reference/nCoV-2019 -1 viruses/sars_cov_2/reads_1.fq -2 viruses/sars_cov_2/reads_2.fq | samtools view -S -b |
`docker container run -w /data -v $PWD/test-data:/data --rm --name artic staphb/bowtie2 bash -c "bowtie2 -x /data/reference/nCoV-2019 -1 viruses/sars_cov_2/reads_1.fq -2 viruses/sars_cov_2/reads_2.fq | samtools view -S -b |
Command | Platform |
---|---|
docker container run -w /data -v $pwd/test-data:/data --rm --name artic staphb/ivar bash -c "ivar trim -i /data/alignments/reads_alignments.sorted.bam -b /data/primer-schemes/nCoV-2019/V3/nCoV-2019.primer.bed -p /data/alignments/reads_alignments.unsorted.trimmed.bam" |
Windows Powershell |
docker container run -w /data -v $PWD/test-data:/data --rm --name artic staphb/ivar bash -c "ivar trim -i /data/alignments/reads_alignments.sorted.bam -b /data/primer-schemes/nCoV-2019/V2/nCoV-2019.primer.bed -p /data/alignments/reads_alignments.unsorted.trimmed.bam" |
Unix |
Command | Platform |
---|---|
docker container run -w /data -v $pwd/test-data:/data --rm --name artic staphb/ivar bash -c "samtools sort /data/alignments/reads_alignments.unsorted.bam > /data/alignments/reads_alignments.sorted.trimmed.bam" |
Windows Powershell |
docker container run -w /data -v $PWD/test-data:/data --rm --name artic staphb/ivar bash -c "samtools sort /data/alignments/reads_alignments.unsorted.trimmed.bam > /data/alignments/reads_alignments.sorted.trimmed.bam" |
Unix |
Command | Platform |
---|---|
docker container run -w /data -v $pwd/test-data:/data --rm --name artic staphb/ivar bash -c "mkdir /data/variants/; samtools mpileup -A -aa -d 0 -Q 0 --reference /data/reference/nCoV-2019.reference.fasta /data/alignments/reads_alignments.sorted.trimmed.bam > /data/variants/reads.pileup.txt" |
Windows Powershell |
docker container run -w /data -v $PWD/test-data:/data --rm --name artic staphb/ivar bash -c "mkdir /data/variants/; samtools mpileup -A -aa -d 0 -Q 0 --reference /data/reference/nCoV-2019.reference.fasta /data/alignments/reads_alignments.sorted.trimmed.bam > /data/variants/reads.pileup.txt" |
Unix |
Command | Platform |
---|---|
`docker container run -w /data -v $pwd/test-data:/data --rm --name artic staphb/ivar bash -c "cat /data/variants/reads.pileup.txt | ivar consensus -p /data/consensus/reads.consensus.fa -m 10 -t 0.5 -n N" ` |
`docker container run -w /data -v $PWD/test-data:/data --rm --name artic staphb/ivar bash -c "cat /data/variants/reads.pileup.txt | ivar consensus -p /data/consensus/reads.consensus.fa -m 10 -t 0.5 -n N" ` |
So now we've used our single sandbox Docker Image. But what if, for example, you have a script that isn't available in this pre-made image? For example, medaka
, which is as we described previously is available in sandbox but also as a standlone Docker Image from Docker hub here
Let's run a consensus run, like we did in the consensus pipeline without pre-installing it AND automatically running our command once it is downloaded.
Command | Platform |
---|---|
docker container run -v $pwd/test-data:/data -w /data staphb/medaka bash -c "medaka_consensus -i /data/demux-fastq_pass/NB03.fastq -d /data/reference/nCoV-2019.reference.fasta -o /data/output/medaka_consensus -m r941_min_high_g303 -f" |
Windows Powershell |
docker container run -v $PWD/test-data:/data -w /data staphb/medaka bash -c "medaka_consensus -i /data/demux-fastq_pass/NB03.fastq -d /data/reference/nCoV-2019.reference.fasta -o /data/output/medaka_consensus -m r941_min_high_g303 -f" |
Unix |
Notice something interesting....
We can immediately run a command directly from our Shell, and have all dependencies automatically installed and our command run. This is one of the many strengths of using Docker as you have a fully runnable command with all required dependencies right out of the box!
That's it for the copy+paste section of the Docker portion of this workshop. Now that you've hopefully had a grasp on the inner-workings of Docker, we can move on to making and installation for your own pipeline(s)!