This is a collection of the computational methods used for the publication <Link will be included when available>. Here we provide all custom scripts and commands to replicate our results. Furthermore you can download the pre-computed training sets for the RDPclassifier <Link> and UTAX <Link>. If you find this information useful please consider citing our article <Link>.
If you are just interested in the pre-computed training sets go to releases and download the tar-ball. There you can find the training sets in the subfolder training.
This is a list of software tools and versions used in the original analysis. We provide precomputed results for most steps so don’t worry if you can’t get a certain tool in a certain version. If you get substantially different results (e.g. using other versions/tools) feel free to contact me. You can also do so by opening an Issue here on GitHub.
Most of the analyses were performed on a Laptop with
- Intel(R) Core(TM) i7-4800MQ CPU @ 2.70GHz Processor
- 16GB RAM
- Ubuntu 14.04 operating system
Only the RDP training was performed on a machine with more RAM available.
- ITS2 database (Version 4.0.0 and inofficial release (included))
- UTAX (Version usearch v8.0.1477_i86linux32)
- RDPclassifier (Version master from GitHub (3ba79f222f7a445620549633e603436c2f0eb7ba, 2015-02-20))
- NCBI::Taxonomy - perl module (Version v0.70.5, doi:10.5281/zenodo.17375)
- R (Version 3.1.2 (2014-10-31) – “Pumpkin Helmet”, Platform: x86_64-pc-linux-gnu (64-bit))
- perl (Version v5.20.2 built for x86_64-linux-gnu-thread-multi)
- QIIME (Version 1.8.0+dfsg-4)
The ITS2 sequences can be retrieved from the ITS2 database. The previous reference db consisted of all “Direct folds & Homology modeled” sequences from “Viridiplantae” (db version v4.0.0, 2011). This file is included for comparison. The new reference database consists of all sequences from “Viridiplantae” of an inofficial ITS2 database release (2014). This file is also included in this repository.
Assign NCBI taxonomy on kingdom, phylum, class, order, family, genus and species level by mapping gi to taxid. Create an analysis folder and execute the following commands there (you need the NCBI::Taxonomy module for this):
grep "^>" ../data/viridiplantae_all_2014.fasta |
perl -pe 's/^>(\d+).*/$1/' >viridiplantae_all_2014.gis
perl ../code/gi2taxonomy.pl\
--gis viridiplantae_all_2014.gis\
--out viridiplantae_all_2014.tax\
--species viridiplantae_all_2014.species.taxids\
--genus viridiplantae_all_2014.genus.taxids
This generates the following files:
- viridiplantae_all_2014.gis
- viridiplantae_all_2014.tax
- viridiplantae_all_2014.species.taxids
- viridiplantae_all_2014.genus.taxids
All of those are also included in the precomputed folder.
The following commands executed in the analysis folder generate the required fasta and tax files for RDP and UTAX:
perl ../code/tax2rdp_utax.pl viridiplantae_all_2014.tax\
../data/viridiplantae_all_2014.fasta viridiplantae_all_2014
This generates the following files:
- viridiplantae_all_2014.gi_tax.map
- viridiplantae_all_2014.rdp.fa
- viridiplantae_all_2014.rdp.tax
- viridiplantae_all_2014.utax.fa
- viridiplantae_all_2014.utax.tax
The first three are also included in the precomputed folder. And the last two are included in the training/utax folder. The utax files are ready to be used for classification. However to speed up the initial step a udb file can be created as follows:
usearch8 -makeudb_usearch viridiplantae_all_2014.utax.fa\
-output viridiplantae_all_2014.utax.udb
This creates the file viridiplantae_all_2014.utax.udb which is not included as it is not required and its size is 225MB. To train the RDPclassifier execute the following commands (warning for the train command 16GB RAM did not suffice, but 32 did):
mkdir rdp_trained
java -jar classifier.jar rm-dupseq --infile viridiplantae_all_2014.rdp.fa\
--outfile viridiplantae_all_2014.rdp.rm-dupseq.fa\
--duplicates --min_seq_length 150
java -jar classifier.jar rm-partialseq viridiplantae_all_2014.rdp.fa\
viridiplantae_all_2014.rdp.rm-dupseq.fa\
viridiplantae_all_2014.rdp.rm-dupseq.rm-partialseq.fa\
--alignment-mode overlap --min_gaps 50 --knn 20
java -Xmx32g -jar classifier.jar train --out_dir rdp_trained\
--seq viridiplantae_all.rdp.rm-dupseq.rm-partialseq.fa\
--tax_file viridiplantae_all.rdp.tax
cp data/its2.properties rdp_trained/its2.properties
This generates the following files:
- viridiplantae_all_2014.rdp.rm-dupseq.fa
- viridiplantae_all_2014.rdp.rm-dupseq.rm-partialseq.fa
All of those are also included in the precomputed folder. And the folder rdp_trained including five files:
- rdp_trained/bergeyTrainingTree.xml
- rdp_trained/genus_wordConditionalProbList.txt
- rdp_trained/its2.properties
- rdp_trained/wordConditionalProbIndexArr.txt
- rdp_trained/logWordPrior.txt
Those are the files required for RDP classification and are included as a tar.gz file in training/rdp
Now you have everything you need to classify sequences with either RDP classifier or UTAX.
Create a folder called raw and download data from EBI SRA repository project accession number PRJEB8640. Extract into separate .fastq files (two for each sample). I assume your directory contains all the samples in the following form: <SampleName>_S<SampleNr>_L001_R<1|2>_001.fastq e.g. PoJ1_S1_L001_R1_001.fastq Where R1 is the file containing forward reads and R2 the file containing reverse reads for each sample.
In the raw folder create a subfolder joined and execute the following commands
qiime
for i in "../*_R1_001.fastq"
do
BASE=$(basename $i _R1_001.fastq)
join_paired_ends.py -f $i -r ../${BASE}_R2_001.fastq -o $BASE
done
This creates a folder for each sample in the form <SampleName>_S<SampleNr>_L001 containing three files:
- fastqjoin.join.fastq
- fastqjoin.un1.fastq
- fastqjoin.un2.fastq
In the raw folder create a subfolder filtered and execute the following commands
for i in ../joined/*
do
BASE=$(basename $i)
usearch8 -fastq_filter $i/fastqjoin.join.fastq\
-fastq_truncqual 19 -fastq_minlen 150 -fastqout $BASE.q20.fq
done
Now you have one .fq file for each sample in the following form <SampleName>_S<SampleNr>_L001.q20.fq with joined and quality filtered reads.
In the raw folder create a subfolder utax and execute the following commands: You can use viridiplantae_all_2014.utax.udb instead of viridiplantae_all_2014.utax.fa if you generated the udb file in the previous steps.
for i in $(find ../filtered -name "*.fq")
do
BASE=$(basename $i .fq)
usearch8 -utax $i -db ../../training/utax/viridiplantae_all_2014.utax.udb\
-utax_rawscore -tt ../../training/utax/viridiplantae_all.utax.tax\
-utaxout $BASE.utax
done
This way you end up with a .utax file for each sample containing the utax classification. Create a subfolder called counts and there execute this:
for i in ../*.utax
do
BASE=$(basename $i .utax)
perl ../../../code/count_taxa_utax.pl --in $i --cutoff 20 >$BASE.count
done
Now you have a list of counts per taxon for each sample. To aggregate the counts of all samples into a common matrix and to create files for phyloseq use the following commands:
perl ../../../code/aggregate_counts.pl *.count >utax_aggregated_counts.tsv
perl -pe 's/^([^\t]+)_(\d+)\t/TID_$2\t/' utax_aggregated_counts.tsv >utax_otu_table
perl -ne 'if(/^([^\t]+)_(\d+)\t/){print "TID_$2\t"; $tax=$1; $tax=~s/_\d+,/\t/g; $tax=~s/__sub__/__/g; $tax=~s/__super__/__/g; print "$tax\n"; }' utax_aggregated_counts.tsv >utax_tax_table
The files
- utax_aggregated_counts.tsv
- utax_otu_table
- utax_tax_table
are included in the precomputed folder
In the raw folder create a subfolder rdp and execute the following commands:
for i in $(find ../filtered -name "*.fq")
do
BASE=$(basename i1 .fq)
java -jar classifier.jar classify\
--train_propfile ../../training/rdp/rdp_trained/its2.properties\
--outputFile $BASE.rdp $i
done
This way you end up with a .rdp file for each sample containing the RDP classification. Create a subfolder called counts and there execute this:
for i in ../*.rdp
do
BASE=$(basename $i .rdp)
perl ../../../code/count_taxa_rdp.pl --in $i --cutoff 0.85 >$BASE.count
done
Now you have a list of counts per taxon for each sample. To aggregate the counts of all samples into a common matrix and to create files for phyloseq use the following commands:
perl ../../../code/aggregate_counts.pl *.count >rdp_aggregated_counts.tsv
perl -pe 's/^([^\t]+)_(\d+)\t/TID_$2\t/' rdp_aggregated_counts.tsv >rdp_otu_table
perl -ne 'if(/^([^\t]+)_(\d+)\t/){print "TID_$2\t"; $tax=$1; $tax=~s/_\d+,/\t/g; $tax=~s/__sub__/__/g; $tax=~s/__super__/__/g; print "$tax\n"; }' rdp_aggregated_counts.tsv >rdp_tax_table
The files
- rdp_aggregated_counts.tsv
- rdp_otu_table
- rdp_tax_table
are included in the precomputed folder