Name	Name	Last commit message	Last commit date
Latest commit History 20 Commits
code	code
data	data
precomputed	precomputed
training	training
.gitignore	.gitignore
LICENSE	LICENSE
README.org	README.org

Meta-Barcoding Dual-Indexing

About

This is a collection of the computational methods used for the publication <Link will be included when available>. Here we provide all custom scripts and commands to replicate our results. Furthermore you can download the pre-computed training sets for the RDPclassifier <Link> and UTAX <Link>. If you find this information useful please consider citing our article <Link>.

Data

If you are just interested in the pre-computed training sets go to releases and download the tar-ball. There you can find the training sets in the subfolder training.

Dependencies

This is a list of software tools and versions used in the original analysis. We provide precomputed results for most steps so don’t worry if you can’t get a certain tool in a certain version. If you get substantially different results (e.g. using other versions/tools) feel free to contact me. You can also do so by opening an Issue here on GitHub.

Hardware

Most of the analyses were performed on a Laptop with

Intel(R) Core(TM) i7-4800MQ CPU @ 2.70GHz Processor
16GB RAM
Ubuntu 14.04 operating system

Only the RDP training was performed on a machine with more RAM available.

Third party tools

ITS2 database (Version 4.0.0 and inofficial release (included))
UTAX (Version usearch v8.0.1477_i86linux32)
RDPclassifier (Version master from GitHub (3ba79f222f7a445620549633e603436c2f0eb7ba, 2015-02-20))
NCBI::Taxonomy - perl module (Version v0.70.5, doi:10.5281/zenodo.17375)
R (Version 3.1.2 (2014-10-31) – “Pumpkin Helmet”, Platform: x86_64-pc-linux-gnu (64-bit))
perl (Version v5.20.2 built for x86_64-linux-gnu-thread-multi)
QIIME (Version 1.8.0+dfsg-4)

Workflow

Preparation of Reference Database

Data

The ITS2 sequences can be retrieved from the ITS2 database. The previous reference db consisted of all “Direct folds & Homology modeled” sequences from “Viridiplantae” (db version v4.0.0, 2011). This file is included for comparison. The new reference database consists of all sequences from “Viridiplantae” of an inofficial ITS2 database release (2014). This file is also included in this repository.

Taxonomic assignment

Assign NCBI taxonomy on kingdom, phylum, class, order, family, genus and species level by mapping gi to taxid. Create an analysis folder and execute the following commands there (you need the NCBI::Taxonomy module for this):

grep "^>" ../data/viridiplantae_all_2014.fasta |
 perl -pe 's/^>(\d+).*/$1/' >viridiplantae_all_2014.gis
perl ../code/gi2taxonomy.pl\
 --gis viridiplantae_all_2014.gis\
 --out viridiplantae_all_2014.tax\
 --species viridiplantae_all_2014.species.taxids\
 --genus viridiplantae_all_2014.genus.taxids

This generates the following files:

viridiplantae_all_2014.gis
viridiplantae_all_2014.tax
viridiplantae_all_2014.species.taxids
viridiplantae_all_2014.genus.taxids

All of those are also included in the precomputed folder.

UTAX and RDP training

The following commands executed in the analysis folder generate the required fasta and tax files for RDP and UTAX:

perl ../code/tax2rdp_utax.pl viridiplantae_all_2014.tax\
 ../data/viridiplantae_all_2014.fasta viridiplantae_all_2014

This generates the following files:

viridiplantae_all_2014.gi_tax.map
viridiplantae_all_2014.rdp.fa
viridiplantae_all_2014.rdp.tax
viridiplantae_all_2014.utax.fa
viridiplantae_all_2014.utax.tax

The first three are also included in the precomputed folder. And the last two are included in the training/utax folder. The utax files are ready to be used for classification. However to speed up the initial step a udb file can be created as follows:

usearch8 -makeudb_usearch viridiplantae_all_2014.utax.fa\
 -output viridiplantae_all_2014.utax.udb

This creates the file viridiplantae_all_2014.utax.udb which is not included as it is not required and its size is 225MB. To train the RDPclassifier execute the following commands (warning for the train command 16GB RAM did not suffice, but 32 did):

mkdir rdp_trained

java -jar classifier.jar rm-dupseq --infile viridiplantae_all_2014.rdp.fa\
 --outfile viridiplantae_all_2014.rdp.rm-dupseq.fa\
 --duplicates --min_seq_length 150

java -jar classifier.jar rm-partialseq viridiplantae_all_2014.rdp.fa\
 viridiplantae_all_2014.rdp.rm-dupseq.fa\
 viridiplantae_all_2014.rdp.rm-dupseq.rm-partialseq.fa\
 --alignment-mode overlap --min_gaps 50 --knn 20

java -Xmx32g -jar classifier.jar train --out_dir rdp_trained\
 --seq viridiplantae_all.rdp.rm-dupseq.rm-partialseq.fa\
 --tax_file viridiplantae_all.rdp.tax

cp data/its2.properties rdp_trained/its2.properties

This generates the following files:

viridiplantae_all_2014.rdp.rm-dupseq.fa
viridiplantae_all_2014.rdp.rm-dupseq.rm-partialseq.fa

All of those are also included in the precomputed folder. And the folder rdp_trained including five files:

rdp_trained/bergeyTrainingTree.xml
rdp_trained/genus_wordConditionalProbList.txt
rdp_trained/its2.properties
rdp_trained/wordConditionalProbIndexArr.txt
rdp_trained/logWordPrior.txt

Those are the files required for RDP classification and are included as a tar.gz file in training/rdp

Now you have everything you need to classify sequences with either RDP classifier or UTAX.

Comparison of new database to old

Analysis of Pollen Samples

Data

Create a folder called raw and download data from EBI SRA repository project accession number PRJEB8640. Extract into separate .fastq files (two for each sample). I assume your directory contains all the samples in the following form: <SampleName>_S<SampleNr>_L001_R<1|2>_001.fastq e.g. PoJ1_S1_L001_R1_001.fastq Where R1 is the file containing forward reads and R2 the file containing reverse reads for each sample.

Preprocessing

joining

In the raw folder create a subfolder joined and execute the following commands

qiime
for i in "../*_R1_001.fastq"
do
    BASE=$(basename $i _R1_001.fastq)
    join_paired_ends.py -f $i -r ../${BASE}_R2_001.fastq -o $BASE
done

This creates a folder for each sample in the form <SampleName>_S<SampleNr>_L001 containing three files:

fastqjoin.join.fastq
fastqjoin.un1.fastq
fastqjoin.un2.fastq

Q20 filtering

In the raw folder create a subfolder filtered and execute the following commands

for i in ../joined/*
do
    BASE=$(basename $i)
    usearch8 -fastq_filter $i/fastqjoin.join.fastq\
     -fastq_truncqual 19 -fastq_minlen 150 -fastqout $BASE.q20.fq
done

Now you have one .fq file for each sample in the following form <SampleName>_S<SampleNr>_L001.q20.fq with joined and quality filtered reads.

Classification

UTAX

In the raw folder create a subfolder utax and execute the following commands: You can use viridiplantae_all_2014.utax.udb instead of viridiplantae_all_2014.utax.fa if you generated the udb file in the previous steps.

for i in $(find ../filtered -name "*.fq")
do   
    BASE=$(basename $i .fq)
    usearch8 -utax $i -db ../../training/utax/viridiplantae_all_2014.utax.udb\
     -utax_rawscore -tt ../../training/utax/viridiplantae_all.utax.tax\
     -utaxout $BASE.utax
done

This way you end up with a .utax file for each sample containing the utax classification. Create a subfolder called counts and there execute this:

for i in ../*.utax
do
    BASE=$(basename $i .utax)
    perl ../../../code/count_taxa_utax.pl --in $i --cutoff 20 >$BASE.count
done

Now you have a list of counts per taxon for each sample. To aggregate the counts of all samples into a common matrix and to create files for phyloseq use the following commands:

perl ../../../code/aggregate_counts.pl *.count >utax_aggregated_counts.tsv
perl -pe 's/^([^\t]+)_(\d+)\t/TID_$2\t/' utax_aggregated_counts.tsv >utax_otu_table
perl -ne 'if(/^([^\t]+)_(\d+)\t/){print "TID_$2\t"; $tax=$1; $tax=~s/_\d+,/\t/g; $tax=~s/__sub__/__/g; $tax=~s/__super__/__/g; print "$tax\n"; }' utax_aggregated_counts.tsv >utax_tax_table

The files

utax_aggregated_counts.tsv
utax_otu_table
utax_tax_table

are included in the precomputed folder

RDP classifier

In the raw folder create a subfolder rdp and execute the following commands:

for i in $(find ../filtered -name "*.fq")
do
    BASE=$(basename i1 .fq)
    java -jar classifier.jar classify\
     --train_propfile ../../training/rdp/rdp_trained/its2.properties\
     --outputFile $BASE.rdp $i
done

This way you end up with a .rdp file for each sample containing the RDP classification. Create a subfolder called counts and there execute this:

for i in ../*.rdp
do
    BASE=$(basename $i .rdp)
    perl ../../../code/count_taxa_rdp.pl --in $i --cutoff 0.85 >$BASE.count
done

Now you have a list of counts per taxon for each sample. To aggregate the counts of all samples into a common matrix and to create files for phyloseq use the following commands:

perl ../../../code/aggregate_counts.pl *.count >rdp_aggregated_counts.tsv
perl -pe 's/^([^\t]+)_(\d+)\t/TID_$2\t/' rdp_aggregated_counts.tsv >rdp_otu_table
perl -ne 'if(/^([^\t]+)_(\d+)\t/){print "TID_$2\t"; $tax=$1; $tax=~s/_\d+,/\t/g; $tax=~s/__sub__/__/g; $tax=~s/__super__/__/g; print "$tax\n"; }' rdp_aggregated_counts.tsv >rdp_tax_table

The files

rdp_aggregated_counts.tsv
rdp_otu_table
rdp_tax_table

are included in the precomputed folder

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Meta-Barcoding Dual-Indexing

About

Data

Dependencies

Hardware

Third party tools

Workflow

Preparation of Reference Database

Data

Taxonomic assignment

UTAX and RDP training

Comparison of new database to old

Analysis of Pollen Samples

Data

Preprocessing

joining

Q20 filtering

Classification

UTAX

RDP classifier

Species accumulation curves

Comparison of utax and RDP

About

Releases 3

Packages

Contributors 4

Languages

License

molbiodiv/meta-barcoding-dual-indexing

Folders and files

Latest commit

History

Repository files navigation

Meta-Barcoding Dual-Indexing

About

Data

Dependencies

Hardware

Third party tools

Workflow

Preparation of Reference Database

Data

Taxonomic assignment

UTAX and RDP training

Comparison of new database to old

Analysis of Pollen Samples

Data

Preprocessing

joining

Q20 filtering

Classification

UTAX

RDP classifier

Species accumulation curves

Comparison of utax and RDP

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 4

Languages

Packages