[TOC]
This tutorial focuses on a subset of the Data Carpentry Genomics workshop curriculum - specifically, this page cover's how to run a BWA workflow on OSG resources. It will use the same general flow as the BWA segment of the Data Carpentry workshop with minor adjustments. The goal of this tutorial is to learn how to convert an existing BWA workflow to run on the OSPool.
Logged into the access point, we will run the tutorial command, that will create a folder for our analysis, as well as some sample files.
tutorial bwa
First, we need to install BWA, also called Burrows-Wheeler Aligner. To do this, we will create and navigate to a new folder in our /home directory called software
. We will then follow the developer's instructions (https://github.com/lh3/bwa) for using git clone
to clone the software and then build the tool using make
.
cd ~/ACE-bwa
cd software
git clone https://github.com/lh3/bwa.git
cd bwa
make
Next, BWA needs to be added to our PATH variables, to test if the installation worked:
export PATH=$PATH:/home/$USER/ACE-bwa/software/bwa/
To check that BWA has been installed correctly, type bwa
. You should receive output similar to the following:
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1198-dirty
Contact: Heng Li <hli@ds.dfci.harvard.edu>
Usage: bwa <command> [options]
Command: index index sequences in the FASTA format
mem BWA-MEM algorithm
fastmap identify super-maximal exact matches
...
Now that we have successfully installed bwa
, we will create a portable compressed tarball of this software so that it is smaller and quicker to transport when we submit our jobs to the OSPool.
cd ~/ACE-bwa/software
tar -czvf bwa.tar.gz bwa
Checking the size of this compressed tarball using ls -lh bwa.tar.gz
reveals the file is approximately 4MB. The tarball should stay in /home.
Now that we have installed BWA, we need to download data to analyze. For this tutorial, we will be downloading data used in the Data Carpentry workshop. This data includes both the genome of Escherichia coli (E. coli) and paired-end RNA sequencing reads obtained from a study carried out by Blount et al. published in PNAS. Additional information about how the data was modified in preparation for this analysis can be found on the Data Carpentry's workshop website.
cd ~/ACE-bwa
./download_data.sh
The files were reduced to ~5,000 lines to reduce analysis time and were renamed to be shorter for teaching purposes.
Investigating the size of the downloaded genome by typing:
ls -lh data/ref_genome/
reveals the file is 1.4 MB. Therefore, this file should remain in /home and does not need to be moved to /public. We should also check the med fastq paired-end read files:
ls -lh data/trimmed_fastq_small
Once everything is downloaded, make sure you're still in the ACE-bwa
directory.
cd ~/ACE-bwa
Now that we have all items in our analysis ready, it is time to submit a single test job to map our RNA reads to the E. coli genome. For a single test job, we will choose a single sample to analyze. In the following example, we will align both the forward and reverse reads of SRR2584863 to the E. coli genome. Using a text editor such as nano
or vim
, we can create an example submit file for this test job called bwa-test.sub
containing the following information:
universe = vanilla
executable = bwa-test.sh
# arguments =
# need to transfer bwa.tar.gz file, the reference
# genome, and the trimmed fastq files
transfer_input_files = software/bwa.tar.gz, data/ref_genome/ecoli_rel606.fasta.gz, data/fastq/SRR2584863_1.trim.sub.fastq, data/fastq/SRR2584863_2.trim.sub.fastq
log = log/bwa_test_job.log
output = log/bwa_test_job.out
error = log/bwa_test_job.err
request_cpus = 1
request_memory = 2GB
request_disk = 1GB
queue 1
You will notice that the .log, .out, and .err files will be saved to a folder called log
.
We will call the script for this analysis bwa-test.sh
and it should contain the following information:
#!/bin/bash
# Script name: bwa-test.sh
echo "Unpacking software"
tar -xzf bwa.tar.gz
echo "Setting PATH for bwa"
export PATH=$_CONDOR_SCRATCH_DIR/bwa/:$PATH
echo "Indexing E. coli genome"
bwa index ecoli_rel606.fasta.gz
echo "Starting bwa alignment for SRR2584863"
bwa mem ecoli_rel606.fasta.gz SRR2584863_1.fastq SRR2584863_2.fastq > SRR2584863.aligned.sam
echo "Done with bwa alignment for SRR2584863!"
echo "Cleaning up files generated from genome indexing"
rm ecoli_rel606.fasta.gz.amb
rm ecoli_rel606.fasta.gz.ann
rm ecoli_rel606.fasta.gz.bwt
rm ecoli_rel606.fasta.gz.pac
rm ecoli_rel606.fasta.gz.sa
We can submit this single test job to HTCondor by typing:
condor_submit bwa-test.sub
To check the status of the job, we can use condor_q
.
Upon the completion of the test job, we should investigate the output to ensure that it is what we expected and also review the .log
file to help optimize future resource requests in preparation for scaling up.
For example, when we investigate the bwa_test_job.log
file created in this analysis, at the bottom of the file we see a resource table:
Partitionable Resources : Usage Request Allocated
Cpus : 1 1
Disk (KB) : 253770 1048576 27945123
Memory (MB) : 144 2048 2500
Here we see that we used less than half of both the disk space and memory we requested. In future jobs, we should request a smaller amount of each resource, such as 0.5 GB of disk space and 0.5 GB of memory. Prior to scaling up our analysis, we should run additional test jobs using these resource requests to ensure that they are sufficient to allow our job to complete successfully.
In preparation for scaling up, please review our guide on how to scale up after a successful test job and how to easily submit multiple jobs with a single submit file.
After reviewing how to submit multiple jobs with a single submit file, it is possible to determine that the most appropriate way to submit multiple jobs for this analysis is to use queue <var> from <list.txt>
.
To use this option, we first need to create a file with just the sample names/IDs that we want to analyze. To do this, we want to cut all information after the "_" symbol to remove the forward/reverse read information and file extensions. For example, we want SRR2584863_1.trim.sub.fastq to become just SRR2584863.
We will save the sample names in a file called samples.txt
:
cd ~/ACE-bwa
cd data/fastq/
ls *.fastq | cut -f 1 -d '_' | uniq > samples.txt
cd ~/ACE-bwa
Now, we can create a new submit file called bwa-alignment.sub
to queue a new job for each sample. To make it simpler to start, you can copy the bwa-test.sub
file (cp bwa-test.sub bwa-alignment.sub
) and modify it.
universe = vanilla
executable = bwa-alignment.sh
arguments = $(sample)
transfer_input_files = software/bwa.tar.gz, data/ref_genome/ecoli_rel606.fasta.gz, data/fastq/$(sample)_1.fastq, data/fastq/$(sample)_2.fastq
transfer_output_remaps = "$(sample).aligned.sam=results/$(sample).aligned.sam"
log = log/bwa_$(sample)_job.log
output = log/bwa_$(sample)_job.out
error = log/bwa_$(sample)_job.err
request_cpus = 1
request_memory = 0.5GB
request_disk = 0.5GB
queue sample from data/fastq/samples.txt
In addition to restructuring our submit file to queue a new job for each sample, it is also advantageous to have our standard output, log, and error files saved to dedicated folders called "log" and "errout" to help keep our output files organized. Therefore, we need to make sure these folders exist in our /home directory prior to submitting our job. We will also create an additional folder to store our aligned sequencing files in a folder called results
:
mkdir results
To store the aligned sequencing files in the results
folder, we can add the transfer_output_remaps
feature to our submit file. This feature allows us to specify a name and a path to save our output files in the format of "file1 = path/to/save/file2", where file1 is the origional name of the document and file2 is the name that we want to save the file using. In the example above, we do not change the name of the resulting output files. This feature also helps us keep an organized working space, rather than having all of our resulting sequencing files be saved to our /home directory.
Once our submit file has been updated, we can update our script to look like and call it something like bwa-alignment.sh
:
#!/bin/bash
# Script name: bwa-alignment.sh
echo "Unpackage software"
tar -xzf bwa.tar.gz
echo "Set PATH for bwa"
export PATH=$_CONDOR_SCRATCH_DIR/bwa/:$PATH
# Renaming first argument
sample=$1
echo "Index E.coli genome"
bwa index ecoli_rel606.fasta.gz
echo "Starting bwa alignment for ${sample}"
bwa mem ecoli_rel606.fasta.gz ${sample}_1.fastq ${sample}_2.fastq > ${sample}.aligned.sam
echo "Done with bwa alignment for ${sample}!"
echo "Cleaning up workspace"
rm ecoli_rel606.fasta.gz.amb
rm ecoli_rel606.fasta.gz.ann
rm ecoli_rel606.fasta.gz.bwt
rm ecoli_rel606.fasta.gz.pac
rm ecoli_rel606.fasta.gz.sa
Once ready, we can submit our job to HTCondor by using condor_submit bwa-alignment.sub
.
When we type condor_q
, we see that three jobs have entered the queue (one for each of our three experimental samples).
When our jobs are completed, we can confirm that our alignment output results files were created by typing:
ls -lh results/*
We can also investigate our log, error, and output files in their respective folders to ensure we obtained the resulting output of these files that we expected.
For more information about running bioinformatics workflows on the OSG, we recommend our BLAST tutorial as well as our Samtools instillation guide.