Skip to content

Increased efficiency in identifying mixed pollen samples by meta-barcoding with a dual-indexing approach - Computational methods

License

Notifications You must be signed in to change notification settings

molbiodiv/meta-barcoding-dual-indexing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Meta-Barcoding Dual-Indexing

About

This is a collection of the computational methods used for the publication <Link will be included when available>. Here we provide all custom scripts and commands to replicate our results. Furthermore you can download the pre-computed training sets for the RDPclassifier <Link> and UTAX <Link>. If you find this information useful please consider citing our article <Link>.

Data

If you are just interested in the pre-computed training sets go to releases and download the tar-ball. There you can find the training sets in the subfolder training.

Dependencies

This is a list of software tools and versions used in the original analysis. We provide precomputed results for most steps so don’t worry if you can’t get a certain tool in a certain version. If you get substantially different results (e.g. using other versions/tools) feel free to contact me. You can also do so by opening an Issue here on GitHub.

Hardware

Most of the analyses were performed on a Laptop with

  • Intel(R) Core(TM) i7-4800MQ CPU @ 2.70GHz Processor
  • 16GB RAM
  • Ubuntu 14.04 operating system

Only the RDP training was performed on a machine with more RAM available.

Third party tools

  • ITS2 database (Version 4.0.0 and inofficial release (included))
  • UTAX (Version usearch v8.0.1477_i86linux32)
  • RDPclassifier (Version master from GitHub (3ba79f222f7a445620549633e603436c2f0eb7ba, 2015-02-20))
  • NCBI::Taxonomy - perl module (Version v0.70.5, doi:10.5281/zenodo.17375)
  • R (Version 3.1.2 (2014-10-31) – “Pumpkin Helmet”, Platform: x86_64-pc-linux-gnu (64-bit))
  • perl (Version v5.20.2 built for x86_64-linux-gnu-thread-multi)
  • QIIME (Version 1.8.0+dfsg-4)

Workflow

Preparation of Reference Database

Data

The ITS2 sequences can be retrieved from the ITS2 database. The previous reference db consisted of all “Direct folds & Homology modeled” sequences from “Viridiplantae” (db version v4.0.0, 2011). This file is included for comparison. The new reference database consists of all sequences from “Viridiplantae” of an inofficial ITS2 database release (2014). This file is also included in this repository.

Taxonomic assignment

Assign NCBI taxonomy on kingdom, phylum, class, order, family, genus and species level by mapping gi to taxid. Create an analysis folder and execute the following commands there (you need the NCBI::Taxonomy module for this):

grep "^>" ../data/viridiplantae_all_2014.fasta |
 perl -pe 's/^>(\d+).*/$1/' >viridiplantae_all_2014.gis
perl ../code/gi2taxonomy.pl\
 --gis viridiplantae_all_2014.gis\
 --out viridiplantae_all_2014.tax\
 --species viridiplantae_all_2014.species.taxids\
 --genus viridiplantae_all_2014.genus.taxids 

This generates the following files:

  • viridiplantae_all_2014.gis
  • viridiplantae_all_2014.tax
  • viridiplantae_all_2014.species.taxids
  • viridiplantae_all_2014.genus.taxids

All of those are also included in the precomputed folder.

UTAX and RDP training

The following commands executed in the analysis folder generate the required fasta and tax files for RDP and UTAX:

perl ../code/tax2rdp_utax.pl viridiplantae_all_2014.tax\
 ../data/viridiplantae_all_2014.fasta viridiplantae_all_2014

This generates the following files:

  • viridiplantae_all_2014.gi_tax.map
  • viridiplantae_all_2014.rdp.fa
  • viridiplantae_all_2014.rdp.tax
  • viridiplantae_all_2014.utax.fa
  • viridiplantae_all_2014.utax.tax

The first three are also included in the precomputed folder. And the last two are included in the training/utax folder. The utax files are ready to be used for classification. However to speed up the initial step a udb file can be created as follows:

usearch8 -makeudb_usearch viridiplantae_all_2014.utax.fa\
 -output viridiplantae_all_2014.utax.udb

This creates the file viridiplantae_all_2014.utax.udb which is not included as it is not required and its size is 225MB. To train the RDPclassifier execute the following commands (warning for the train command 16GB RAM did not suffice, but 32 did):

mkdir rdp_trained

java -jar classifier.jar rm-dupseq --infile viridiplantae_all_2014.rdp.fa\
 --outfile viridiplantae_all_2014.rdp.rm-dupseq.fa\
 --duplicates --min_seq_length 150

java -jar classifier.jar rm-partialseq viridiplantae_all_2014.rdp.fa\
 viridiplantae_all_2014.rdp.rm-dupseq.fa\
 viridiplantae_all_2014.rdp.rm-dupseq.rm-partialseq.fa\
 --alignment-mode overlap --min_gaps 50 --knn 20

java -Xmx32g -jar classifier.jar train --out_dir rdp_trained\
 --seq viridiplantae_all.rdp.rm-dupseq.rm-partialseq.fa\
 --tax_file viridiplantae_all.rdp.tax

cp data/its2.properties rdp_trained/its2.properties

This generates the following files:

  • viridiplantae_all_2014.rdp.rm-dupseq.fa
  • viridiplantae_all_2014.rdp.rm-dupseq.rm-partialseq.fa

All of those are also included in the precomputed folder. And the folder rdp_trained including five files:

  • rdp_trained/bergeyTrainingTree.xml
  • rdp_trained/genus_wordConditionalProbList.txt
  • rdp_trained/its2.properties
  • rdp_trained/wordConditionalProbIndexArr.txt
  • rdp_trained/logWordPrior.txt

Those are the files required for RDP classification and are included as a tar.gz file in training/rdp

Now you have everything you need to classify sequences with either RDP classifier or UTAX.

Comparison of new database to old

Analysis of Pollen Samples

Data

Create a folder called raw and download data from EBI SRA repository project accession number PRJEB8640. Extract into separate .fastq files (two for each sample). I assume your directory contains all the samples in the following form: <SampleName>_S<SampleNr>_L001_R<1|2>_001.fastq e.g. PoJ1_S1_L001_R1_001.fastq Where R1 is the file containing forward reads and R2 the file containing reverse reads for each sample.

Preprocessing

joining

In the raw folder create a subfolder joined and execute the following commands

qiime
for i in "../*_R1_001.fastq"
do
    BASE=$(basename $i _R1_001.fastq)
    join_paired_ends.py -f $i -r ../${BASE}_R2_001.fastq -o $BASE
done

This creates a folder for each sample in the form <SampleName>_S<SampleNr>_L001 containing three files:

  • fastqjoin.join.fastq
  • fastqjoin.un1.fastq
  • fastqjoin.un2.fastq

Q20 filtering

In the raw folder create a subfolder filtered and execute the following commands

for i in ../joined/*
do
    BASE=$(basename $i)
    usearch8 -fastq_filter $i/fastqjoin.join.fastq\
     -fastq_truncqual 19 -fastq_minlen 150 -fastqout $BASE.q20.fq
done

Now you have one .fq file for each sample in the following form <SampleName>_S<SampleNr>_L001.q20.fq with joined and quality filtered reads.

Classification

UTAX

In the raw folder create a subfolder utax and execute the following commands: You can use viridiplantae_all_2014.utax.udb instead of viridiplantae_all_2014.utax.fa if you generated the udb file in the previous steps.

for i in $(find ../filtered -name "*.fq")
do   
    BASE=$(basename $i .fq)
    usearch8 -utax $i -db ../../training/utax/viridiplantae_all_2014.utax.udb\
     -utax_rawscore -tt ../../training/utax/viridiplantae_all.utax.tax\
     -utaxout $BASE.utax
done 

This way you end up with a .utax file for each sample containing the utax classification. Create a subfolder called counts and there execute this:

for i in ../*.utax
do
    BASE=$(basename $i .utax)
    perl ../../../code/count_taxa_utax.pl --in $i --cutoff 20 >$BASE.count
done

Now you have a list of counts per taxon for each sample. To aggregate the counts of all samples into a common matrix and to create files for phyloseq use the following commands:

perl ../../../code/aggregate_counts.pl *.count >utax_aggregated_counts.tsv
perl -pe 's/^([^\t]+)_(\d+)\t/TID_$2\t/' utax_aggregated_counts.tsv >utax_otu_table
perl -ne 'if(/^([^\t]+)_(\d+)\t/){print "TID_$2\t"; $tax=$1; $tax=~s/_\d+,/\t/g; $tax=~s/__sub__/__/g; $tax=~s/__super__/__/g; print "$tax\n"; }' utax_aggregated_counts.tsv >utax_tax_table

The files

  • utax_aggregated_counts.tsv
  • utax_otu_table
  • utax_tax_table

are included in the precomputed folder

RDP classifier

In the raw folder create a subfolder rdp and execute the following commands:

for i in $(find ../filtered -name "*.fq")
do
    BASE=$(basename i1 .fq)
    java -jar classifier.jar classify\
     --train_propfile ../../training/rdp/rdp_trained/its2.properties\
     --outputFile $BASE.rdp $i
done

This way you end up with a .rdp file for each sample containing the RDP classification. Create a subfolder called counts and there execute this:

for i in ../*.rdp
do
    BASE=$(basename $i .rdp)
    perl ../../../code/count_taxa_rdp.pl --in $i --cutoff 0.85 >$BASE.count
done

Now you have a list of counts per taxon for each sample. To aggregate the counts of all samples into a common matrix and to create files for phyloseq use the following commands:

perl ../../../code/aggregate_counts.pl *.count >rdp_aggregated_counts.tsv
perl -pe 's/^([^\t]+)_(\d+)\t/TID_$2\t/' rdp_aggregated_counts.tsv >rdp_otu_table
perl -ne 'if(/^([^\t]+)_(\d+)\t/){print "TID_$2\t"; $tax=$1; $tax=~s/_\d+,/\t/g; $tax=~s/__sub__/__/g; $tax=~s/__super__/__/g; print "$tax\n"; }' rdp_aggregated_counts.tsv >rdp_tax_table

The files

  • rdp_aggregated_counts.tsv
  • rdp_otu_table
  • rdp_tax_table

are included in the precomputed folder

Species accumulation curves

Comparison of utax and RDP

About

Increased efficiency in identifying mixed pollen samples by meta-barcoding with a dual-indexing approach - Computational methods

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •