Skip to content

Materials for 2022 ACE Global Consortium Meeting

Notifications You must be signed in to change notification settings

OSGConnect/ACE-bwa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

97 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[TOC]

Introduction

This tutorial focuses on a subset of the Data Carpentry Genomics workshop curriculum - specifically, this page cover's how to run a BWA workflow on OSG resources. It will use the same general flow as the BWA segment of the Data Carpentry workshop with minor adjustments. The goal of this tutorial is to learn how to convert an existing BWA workflow to run on the OSPool.

Get Tutorial Files

Logged into the access point, we will run the tutorial command, that will create a folder for our analysis, as well as some sample files.

     tutorial bwa

Install and Prepare BWA

First, we need to install BWA, also called Burrows-Wheeler Aligner. To do this, we will create and navigate to a new folder in our /home directory called software. We will then follow the developer's instructions (https://github.com/lh3/bwa) for using git clone to clone the software and then build the tool using make.

     cd ~/ACE-bwa
     cd software
     git clone https://github.com/lh3/bwa.git
     cd bwa
     make

Next, BWA needs to be added to our PATH variables, to test if the installation worked:

     export PATH=$PATH:/home/$USER/ACE-bwa/software/bwa/

To check that BWA has been installed correctly, type bwa. You should receive output similar to the following:

     Program: bwa (alignment via Burrows-Wheeler transformation)
     Version: 0.7.17-r1198-dirty
     Contact: Heng Li <hli@ds.dfci.harvard.edu>

     Usage:   bwa <command> [options]

     Command: index         index sequences in the FASTA format
              mem           BWA-MEM algorithm
              fastmap       identify super-maximal exact matches
     ...

Now that we have successfully installed bwa, we will create a portable compressed tarball of this software so that it is smaller and quicker to transport when we submit our jobs to the OSPool.

     cd ~/ACE-bwa/software
     tar -czvf bwa.tar.gz bwa

Checking the size of this compressed tarball using ls -lh bwa.tar.gz reveals the file is approximately 4MB. The tarball should stay in /home.

Download Data to Analyze

Now that we have installed BWA, we need to download data to analyze. For this tutorial, we will be downloading data used in the Data Carpentry workshop. This data includes both the genome of Escherichia coli (E. coli) and paired-end RNA sequencing reads obtained from a study carried out by Blount et al. published in PNAS. Additional information about how the data was modified in preparation for this analysis can be found on the Data Carpentry's workshop website.

     cd ~/ACE-bwa
     ./download_data.sh

The files were reduced to ~5,000 lines to reduce analysis time and were renamed to be shorter for teaching purposes.

Investigating the size of the downloaded genome by typing:

     ls -lh data/ref_genome/

reveals the file is 1.4 MB. Therefore, this file should remain in /home and does not need to be moved to /public. We should also check the med fastq paired-end read files:

     ls -lh data/trimmed_fastq_small

Once everything is downloaded, make sure you're still in the ACE-bwa directory.

     cd ~/ACE-bwa

Run a Single Test Job

Now that we have all items in our analysis ready, it is time to submit a single test job to map our RNA reads to the E. coli genome. For a single test job, we will choose a single sample to analyze. In the following example, we will align both the forward and reverse reads of SRR2584863 to the E. coli genome. Using a text editor such as nano or vim, we can create an example submit file for this test job called bwa-test.sub containing the following information:

     universe    = vanilla
     executable  = bwa-test.sh
     # arguments = 

     # need to transfer bwa.tar.gz file, the reference
     # genome, and the trimmed fastq files
     transfer_input_files = software/bwa.tar.gz, data/ref_genome/ecoli_rel606.fasta.gz, data/fastq/SRR2584863_1.trim.sub.fastq, data/fastq/SRR2584863_2.trim.sub.fastq

     log         = log/bwa_test_job.log
     output      = log/bwa_test_job.out
     error       = log/bwa_test_job.err

     request_cpus    = 1
     request_memory  = 2GB
     request_disk    = 1GB
     
     queue 1

You will notice that the .log, .out, and .err files will be saved to a folder called log.

We will call the script for this analysis bwa-test.sh and it should contain the following information:

     #!/bin/bash
     # Script name: bwa-test.sh

     echo "Unpacking software"
     tar -xzf bwa.tar.gz

     echo "Setting PATH for bwa" 
     export PATH=$_CONDOR_SCRATCH_DIR/bwa/:$PATH

     echo "Indexing E. coli genome"
     bwa index ecoli_rel606.fasta.gz

     echo "Starting bwa alignment for SRR2584863"
     bwa mem ecoli_rel606.fasta.gz SRR2584863_1.fastq SRR2584863_2.fastq > SRR2584863.aligned.sam

     echo "Done with bwa alignment for SRR2584863!"

     echo "Cleaning up files generated from genome indexing"
     rm ecoli_rel606.fasta.gz.amb
     rm ecoli_rel606.fasta.gz.ann
     rm ecoli_rel606.fasta.gz.bwt
     rm ecoli_rel606.fasta.gz.pac
     rm ecoli_rel606.fasta.gz.sa

We can submit this single test job to HTCondor by typing:

     condor_submit bwa-test.sub

To check the status of the job, we can use condor_q.

Upon the completion of the test job, we should investigate the output to ensure that it is what we expected and also review the .log file to help optimize future resource requests in preparation for scaling up.

For example, when we investigate the bwa_test_job.log file created in this analysis, at the bottom of the file we see a resource table:

   Partitionable Resources :    Usage  Request Allocated 
       Cpus                 :                 1         1 
       Disk (KB)            :   253770  1048576  27945123 
       Memory (MB)          :      144     2048      2500

Here we see that we used less than half of both the disk space and memory we requested. In future jobs, we should request a smaller amount of each resource, such as 0.5 GB of disk space and 0.5 GB of memory. Prior to scaling up our analysis, we should run additional test jobs using these resource requests to ensure that they are sufficient to allow our job to complete successfully.

Scaling Up to Analyze Multiple Samples

In preparation for scaling up, please review our guide on how to scale up after a successful test job and how to easily submit multiple jobs with a single submit file.

After reviewing how to submit multiple jobs with a single submit file, it is possible to determine that the most appropriate way to submit multiple jobs for this analysis is to use queue <var> from <list.txt>.

To use this option, we first need to create a file with just the sample names/IDs that we want to analyze. To do this, we want to cut all information after the "_" symbol to remove the forward/reverse read information and file extensions. For example, we want SRR2584863_1.trim.sub.fastq to become just SRR2584863.

We will save the sample names in a file called samples.txt:

     cd ~/ACE-bwa
     cd data/fastq/
     ls *.fastq | cut -f 1 -d '_' | uniq > samples.txt
     cd ~/ACE-bwa

Now, we can create a new submit file called bwa-alignment.sub to queue a new job for each sample. To make it simpler to start, you can copy the bwa-test.sub file (cp bwa-test.sub bwa-alignment.sub) and modify it.

     universe    = vanilla
     executable  = bwa-alignment.sh
     arguments   = $(sample)

     transfer_input_files = software/bwa.tar.gz, data/ref_genome/ecoli_rel606.fasta.gz, data/fastq/$(sample)_1.fastq, data/fastq/$(sample)_2.fastq
     transfer_output_remaps = "$(sample).aligned.sam=results/$(sample).aligned.sam"

     log         = log/bwa_$(sample)_job.log
     output      = log/bwa_$(sample)_job.out
     error       = log/bwa_$(sample)_job.err

     request_cpus    = 1 
     request_memory  = 0.5GB
     request_disk    = 0.5GB

     queue sample from data/fastq/samples.txt

In addition to restructuring our submit file to queue a new job for each sample, it is also advantageous to have our standard output, log, and error files saved to dedicated folders called "log" and "errout" to help keep our output files organized. Therefore, we need to make sure these folders exist in our /home directory prior to submitting our job. We will also create an additional folder to store our aligned sequencing files in a folder called results:

     mkdir results

To store the aligned sequencing files in the results folder, we can add the transfer_output_remaps feature to our submit file. This feature allows us to specify a name and a path to save our output files in the format of "file1 = path/to/save/file2", where file1 is the origional name of the document and file2 is the name that we want to save the file using. In the example above, we do not change the name of the resulting output files. This feature also helps us keep an organized working space, rather than having all of our resulting sequencing files be saved to our /home directory.

Once our submit file has been updated, we can update our script to look like and call it something like bwa-alignment.sh:

     #!/bin/bash
     # Script name: bwa-alignment.sh

     echo "Unpackage software"
     tar -xzf bwa.tar.gz

     echo "Set PATH for bwa" 
     export PATH=$_CONDOR_SCRATCH_DIR/bwa/:$PATH

     # Renaming first argument
     sample=$1

     echo "Index E.coli genome"
     bwa index ecoli_rel606.fasta.gz

     echo "Starting bwa alignment for ${sample}"
     bwa mem ecoli_rel606.fasta.gz ${sample}_1.fastq ${sample}_2.fastq > ${sample}.aligned.sam

     echo "Done with bwa alignment for ${sample}!"

     echo "Cleaning up workspace"
     rm ecoli_rel606.fasta.gz.amb
     rm ecoli_rel606.fasta.gz.ann
     rm ecoli_rel606.fasta.gz.bwt
     rm ecoli_rel606.fasta.gz.pac
     rm ecoli_rel606.fasta.gz.sa

Once ready, we can submit our job to HTCondor by using condor_submit bwa-alignment.sub.

When we type condor_q, we see that three jobs have entered the queue (one for each of our three experimental samples).

When our jobs are completed, we can confirm that our alignment output results files were created by typing:

     ls -lh results/*

We can also investigate our log, error, and output files in their respective folders to ensure we obtained the resulting output of these files that we expected.

For more information about running bioinformatics workflows on the OSG, we recommend our BLAST tutorial as well as our Samtools instillation guide.

About

Materials for 2022 ACE Global Consortium Meeting

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages