Erick Lu
February 10, 2020
- Introduction
- Finding raw sequencing data in GEO
- Downloading FASTQ files using the SRA toolkit
- Automating downloads using Python
- Running fastq-dump without prefetch is slow
- Conclusion
Most scientific journals require scientists to make their sequencing data publicly available. This way, other researchers in the world can download the raw data and re-analyze it for their own purposes. Raw RNA-sequencing data is usually deposited in the Gene Expression Omnibus (GEO), at https://www.ncbi.nlm.nih.gov/geo/.
How do we access the data? Raw sequencing data comes in huge files that are often multiple gigabytes in size per sample. If you are a researcher with little bioinformatics experience, the finding and downloading the data can be somewhat complicated. This guide explains how to:
- Navigate through GEO to find raw sequencing data.
- Download and convert SRA files to FASTQ files using the NCBI's SRA toolkit.
- Use a Python script to batch download files with the SRA
prefetch
andfastq-dump
tools.
Let's say you are reading a paper in a journal and see an interesting RNA-seq experiment. You decide that you want to sift through the data for your own genes of interest. The first step is finding the GEO accession number corresponding to the dataset. To find it, you should navigate to the methods section and search (Ctrl-F) for "GSE". An identifier such as GSEXXXXX, where X represents an integer, usually shows up in a statement such as: "the data have been deposited in GEO under accession number GSEXXXXX". If that doesn't work, try to search for "GEO". Once we have the accession number, we can now search GEO to find the dataset.
As an example, let's work with an RNA-seq study that I analyzed for Li et al. Nature 2019, which can be found in GEO under accession number GSE71165 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE71165). The purpose of this analysis was to explore the genes that splenic dendritic cells upregulated upon stimulation. Following the link, we can see all the details associated with the study. If we scroll to the bottom of the page, we should see a list of samples as well as a link to the SRA Run Selector, which I've pointed out in the following image:
If we click on the SRA Run Selector link, we will see a table of identifers for each of the samples. What we need are the SRA Run IDs, which start with the prefix "SRR". I've pointed out where to find them in the image below:
The SRA runs (e.g. SRR2121685) correspond to the actual sequencing files that we want to download in order to access the raw data. Sometimes, there will be multiple SRA runs for each sample, and they will be displayed as multiple rows in the table above, sharing a common GSM and SRX number. This means that the lab had deposited multiple FASTQ files for one sample and did not bother to concatenate them together prior to deposition.
You can get more details about how each sample was prepared clicking on the GSM identifier in the Samples section from the first image (e.g. GSM1828772). This will take you to the sample description page. Scrolling down to the Relations section will show you the sample's SRA identifier, beginning with "SRX":
Clicking on the SRA number (for example, SRX1114423 for HET_CD4_1) will bring you to yet another page that provides information about the SRA runs for that sample (e.g. SRR2121685 for HET_CD4_1), shown below:
Although it's more work, I prefer clicking through these pages for each individual sample because they provide important information such as how the libraries were prepared. I have summarized the different identifiers for GSE71165 in the following table:
Sample Name | GSM Identifier | SRA Identifier (SRX) | SRA Runs (SRR, download these) |
---|---|---|---|
HET_CD4_1 | GSM1828772 | SRX1114423 | SRR2121685 |
HET_CD4_2 | GSM1828773 | SRX1114424 | SRR2121686 |
IMM_CD4_1 | GSM1828774 | SRX1114425 | SRR2121687 |
IMM_CD4_2 | GSM1828775 | SRX1114426 | SRR2121688 |
To eventually get the raw data in FASTQ format, we first need to first download the SRA Run files associated with each sample (for example, SRR2121685 for HET_CD4_1). Downloading SRR2121685 will result in a file called SRR2121685.sra. But what is a .sra file and what does it do? A SRA file can be used by the NCBI's SRA toolkit as a set of "instructions" to construct the the FASTQ file. The next section explains the SRA toolkit and shows you how to download and convert SRA files into FASTQ files.
In order to download the SRA files onto your machine, we use the NCBI's SRA toolkit, which lets us use the command line to download a specified SRA run. If you are using a Linux platform, you can type: apt install sra-toolkit
in your command line to install the toolkit. You can read more about SRA toolkit here: https://www.ncbi.nlm.nih.gov/books/NBK242621/ and at their github repo: https://github.com/ncbi/sra-tools.
The toolkit works by first using the prefetch
command to download the SRA file associated with the specified SRA run ID.
For example, to download the SRA file for HET_CD4_1 (SRA Run identifier: SRR2121685), the command would be:
prefetch SRR2121685
You should observe the following output from running the command:
2020-02-06T21:54:29 prefetch.2.8.2: 1) Downloading 'SRR2121685'...
2020-02-06T21:54:29 prefetch.2.8.2: Downloading via https...
2020-02-06T21:57:32 prefetch.2.8.2: 1) 'SRR2121685' was downloaded successfully
The file SRR2121685.sra
should be downloaded into your home directory at ~/ncbi/public/sra/.
ls ~/ncbi/public/sra
SRR2121685.sra
After you have downloaded the SRA file, you can use the command fastq-dump
to extract the contents of it into a .fastq.gz
file. The Edwards lab at SDSU provides a nice tutorial for how to use fastq-dump here: https://edwards.sdsu.edu/research/fastq-dump/, with each of the settings explained. A sample command to extract SRR2121685.sra would be:
fastq-dump --outdir fastq --gzip --skip-technical --readids --read-filter pass --dumpbase --split-3 --clip ~/ncbi/public/sra/SRR2121685.sra
If successful, you should see the following output show up in your terminal:
Read 27928438 spots for /home/ericklu/ncbi/public/sra/SRR2121685.sra
Written 27928438 spots for /home/ericklu/ncbi/public/sra/SRR2121685.sra
We can check the folder fastq/
to make sure our files were downloaded correctly:
ls fastq
SRR2121685_pass_1.fastq.gz SRR2121685_pass_2.fastq.gz
We observe that two fastq files have been extracted from SRR2121685.sra. This is because the original data was produced from paired-end sequencing, which usually has both a Read1 file and Read2 file. fastq-dump
has extracted the SRA file into two files, with suffix "_1" for paired-end read 1 and "_2" for paired-end read 2. I typically use the settings provided above for fastq-dump
as my default settings.
Since there are lots of SRA files associated with our samples, it would take a long time to manually run prefetch
and fastq-dump
for all the files. To automate this process, I wrote a small script in python to first download each SRA file using prefetch
and then run fastq-dump
. The code is shown below and also provided in this repo as fastq_download.py
:
import subprocess
# samples correspond to Het_1, Het_2, Imm_1, Imm_2
sra_numbers = [
"SRR2121685", "SRR2121686", "SRR2121687", "SRR2121688"
]
# this will download the .sra files to ~/ncbi/public/sra/ (will create directory if not present)
for sra_id in sra_numbers:
print ("Currently downloading: " + sra_id)
prefetch = "prefetch " + sra_id
print ("The command used was: " + prefetch)
subprocess.call(prefetch, shell=True)
# this will extract the .sra files from above into a folder named 'fastq'
for sra_id in sra_numbers:
print ("Generating fastq for: " + sra_id)
fastq_dump = "fastq-dump --outdir fastq --gzip --skip-technical --readids --read-filter pass --dumpbase --split-3 --clip ~/ncbi/public/sra/" + sra_id + ".sra"
print ("The command used was: " + fastq_dump)
subprocess.call(fastq_dump, shell=True)
We can run the python script by simply navigating to the folder on your machine where you want to store the FASTQ files (via the command line), then running python fastq_download.py
. After running the python script, our FASTQ files should all be sitting in a directory called 'fastq'.
ls fastq
SRR2121685_pass_1.fastq.gz SRR2121686_pass_1.fastq.gz SRR2121687_pass_1.fastq.gz SRR2121688_pass_1.fastq.gz
SRR2121685_pass_2.fastq.gz SRR2121686_pass_2.fastq.gz SRR2121687_pass_2.fastq.gz SRR2121688_pass_2.fastq.gz
If you would like to re-purpose the Python script for you own project, simply replace the SRA run IDs with those that match the files you want to download, by modifying the variable sra_numbers
.
If you don't pre-download the SRA files and just run the fastq-dump
command, the FASTQ file will still be generated. The SRA toolkit manual at https://www.ncbi.nlm.nih.gov/books/NBK242621/ says that this is a valid alternative. I would advise against it, since I have found this method to be much slower than first running prefetch
and then fastq-dump
on the pre-downloaded SRA files. Here's some proof, in which I time the processes below:
- Time it takes to download SRR2121685 using
prefetch
:
time prefetch SRR2121685
2020-02-07T02:39:54 prefetch.2.8.2: 1) Downloading 'SRR2121685'...
2020-02-07T02:39:54 prefetch.2.8.2: Downloading via https...
2020-02-07T02:43:07 prefetch.2.8.2: 1) 'SRR2121685' was downloaded successfully
real 3m15.028s
user 0m51.353s
sys 0m12.287s
- Time it takes to
fastq-dump
from the pre-downloaded SRR2121685 file:
time fastq-dump --outdir fastq --gzip --skip-technical --readids --read-filter pass --dumpbase --split-3 --clip ~/ncbi/public/sra/SRR2121685.sra
Read 27928438 spots for /home/ericklu/ncbi/public/sra/SRR2121685.sra
Written 27928438 spots for /home/ericklu/ncbi/public/sra/SRR2121685.sra
real 22m21.284s
user 22m11.772s
sys 0m4.078s
- Time it takes to
fastq-dump
without pre-downloading SRR2121685:
time fastq-dump --outdir fastq --gzip --skip-technical --readids --read-filter pass --dumpbase --split-3 --clip SRR2121685
Read 27928438 spots for SRR2121685
Written 27928438 spots for SRR2121685
real 77m34.548s
user 25m37.226s
sys 0m12.551s
Downloading the SRA file took 3 min 15 seconds, and running fastq-dump
on the pre-downloaded SRA file of size 3210181560 bytes took 22 min 21 seconds, making the prefetch
+ fastq-dump
option take a total of 25 min 36 seconds. In comparison, running fastq-dump
without pre-downloading the files for the same SRA ID took a total time of 77 minutes 34 seconds!
We are done downloading the FASTQ files! Now, we can start mapping the reads to a reference genome and perform downstream bulk RNA-sequencing analysis. If you want to see more, I performed differential gene expression analysis on this data to find genes that were upregulated in the IMM_CD4_1 and IMM_CD4_2 samples compared to the HET_CD4_1 and HET_CD4_2 samples, which you can find at: https://github.com/erilu/dendritic-cell-bulk-rnaseq. I hope that this short tutorial has helped you learn how to use the SRA tools to download raw sequencing data. Thanks for reading!