Includes the process and commands needed to download the data to scratch on Stampede with 00_gsaf_download
. In theory, this only has to be done once, but it could be done again if data are lost or compromised.
The first step in the bioinformatics pipeline is to download the data from the sequencing facility, in this case GSAF. The GSAF provides the script for downloading the data from there amazon web server.
- Setup project and data directories on Stampede.
- Copy script to download data: Save a script to download the data (called "00_gsaf_download.sh") in each subdirector
- Download the data using TACC: For this three-part step I created a commands file (called "00_gsaf_download.cmd") with the command to execute the download script, created a launcher script to execute the commands file to execute the launcher script (I know, sounds like a lot of steps, but this is so I use TACC's compute power not my own), and launched the job on TACC
- Repeat steps 2 and 3 for all RNAseq jobs
Login to TACC using ssh with your password and authentication credentials. Replace "" with your TACC user name.
ssh <username>@stampede.tacc.utexas.edu
On scratch, create the project directory using the job number (e.g. JA16444) and subsubdirectory called 00_rawdata. The argument -p
will create the parent and subdirectories if they do not already exist.
mkdir -p $SCRATCH/JA16444/00_rawdata
cd $SCRATCH/JA16444/00_rawdata
Copy the text for the gsaf download script found here: https://wikis.utexas.edu/display/GSAF/How+to+download+your+data. Use nano to save it as 00_gsaf_download.sh
. Then, make the file executable.
chmod a+x 00_gsaf_download.sh
Create a luncher scipt to execute the data on TACC. To so, use nano t create a new file 00_gsaf_download.cmds
with the following line of text `00_gsaf_download.sh ""
Now, I use launcher_creator.py
to create a launcher script that will tell how to launch this job on TACC. The arguments are defined clearly on this website: https://wikis.utexas.edu/display/bioiteam/launcher_creator.py. Then I will use sbatch 00_gsaf_download.slurm
to launch the job.
launcher_creator.py -t 12:00:00 -n 00_gsaf_download -j 00_gsaf_download.cmds -l 00_gsaf_download.slurm -q normal
sbatch 00_gsaf_download.slurm
To protect our data from accidental deletion or overriding, let's make all the files read only!
chmod a-w *fastq.gz
ls -l
Create a new directory called wget_log
and move all the output files associated with downloading the data.
mkdir wget_log
mv *.wget.log wget_log
ls
The great thing about using TACC for this is that you can go about doing other things while the files are download. Reset the environment variables above and repeat the process for all other RNAseq jobs.
Samples often get named different things along the way. This is an attempt to make a rossetta stone for future use.
for R1 in *R1_001.fastq.gz
do
R2=$(basename $R1 R1_001.fastq.gz)R2_001.fastq.gz
echo "$R1" | awk -F '_' '{print "15-" $1 "," "15" $1 "," $2 "," $1 "_" $2 "_" $3 "," $R1 }' >> 00_rossettastone.csv
done