Skip to content

15. Useful scripts

rprops edited this page Dec 17, 2018 · 15 revisions

1. Cleanup directories

This script goes through the directories in the current working directory and removes specific files. It first checks whether a specific file is present before removing other files. By changing the -e flag to -d it checks for the presence of a specific sub directory before removing the files with a designated pattern.

#!/bin/bash
set -e
for i in */; do
echo $i
cd $i
        # Check if file *R1.fastq exists
        if [ -e *.R1.fastq ]; then
        # Remove fastqs except the ones with the derep_scythe_sickle_ pattern
        ls *.fastq | grep -v "derep_scythe_sickle_*" | xargs rm
   fi
cd -
done

Download sequences from NCBI SRA

You need to have sra toolkit in your path. Run like this: bash download_NCBI.sh sra.list with sra.list containing a list of SRR identifiers:

SRR5808316
SRR5808317
SRR5808318
SRR5808319
...
#!/bin/bash

set -e

for file in `cat $1`
do
stub1=${file:0:6}
wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/${stub1}/${file}/${file}.sra
fastq-dump --split-3 ${file}.sra
done

One-liners

Compress whole directory

tar -zcvf archive.tar.gz directory/ 

Count number of sequences in fastq file

awk '{s++}END{print s/4}' input.fastq

Remove sequences shorter than threshold (bbmap toolset)

reformat.sh in=input.fa out=output.fa minlength=1000

Read in lines from file and do something with them:

for sample in `cat SAMPLE_IDs`; do something; done

Select lines from input.txt using multiple patterns read from patterns.tsv:

grep -Fwf patterns.tsv input.txt > input_selected_lines.tsv.

Rename fasta headers. This code will replace the header with contig_i with i going from 1 to number of contigs.

awk '/^>/{print ">contig_" ++i; next}{print}' < input.fasta > input-fixed.fasta

In order to have a list to map back the original contig names to the new ones (e.g. you have to import the binning associated with the individual fasta files in Anvi'o). new_original_contig_list.tsv contains the new and original contig headers as well as the fasta file name (i.e. bin name). new_contig_list.tsv just contains the new contig header and the associated file/bin name. Adjust input-fixed.fasta to the correct file if necessary. anvi-import-collection can be directly applied with the new_contig_list.tsv file.

for files in `ls *.fa`; do
	echo $files
	dos2unix $files
	grep ">" $files | sed "s/>//g" | awk -v file2="${files}" -F "\t" '{print $1 "\t" file2}' >> original_contig_list.tsv
done
grep ">" input-fixed.fasta | sed "s/>//g" > tmp
paste tmp original_contig_list.tsv | column -s $'\t' -t > new_original_contig_list.tsv 
awk '{print $1 "\t" $3}' new_original_contig_list.tsv | sed "s/\.fa//g"> new_contig_list.tsv

Read pattern from file1.tsv and return lines which match the patterns from file1.tsv in the first column.

grep -f <(sed 's/.*/\^&\\>/' file1.tsv) file2.tsv

3. Flux

Wait for job 34567890 to finish without errors before starting task.pbs.

qsub -W depend=afterok:34567890 task.pbs

4. Split fasta into multiple fasta files

This shell script will split the input fasta into fasta files for its constituent sequences with the fasta headers as sample names.

#!/bin/bash
while read line
do
    if [[ ${line:0:1} == '>' ]]
    then
        outfile=${line#>}.fa
        echo $line > $outfile
    else
        echo $line >> $outfile
    fi
done < $1

Clone this wiki locally