This repository includes the implementation and experimental data of 'OptTaxPro: Optimized Taxonomic Profiling of microbiome using full-length 16S rRNA sequences'. Please cite our paper if you use our pipeline. Fill free to report any issue for maintenance of our model.
If you have used OptTaxPro in your research, please cite the following publication:
In Review
We strongly recommend you to use python virtual environment with Anaconda/Miniconda. The details of the environment we used are as follows:
- python 3.9.18
- cutadapt 5.0
- ete3 3.1.3
- numpy 1.26.4
- pandas 2.2.3
- vsearch 2.30.0
- scikit-learn 1.2.2 (only for building custom HSGs)
- scikit-learn-extra 0.3.0 (only for building custom HSGs)
conda update conda (optional)
cd OptTaxPro/
conda env create -f environment.yml -n OptTaxPro
conda activate OptTaxPro
Please make sure all the required programs are successfully installed
The source files and useful scripts are in this repository. The database files have been uploaded on data directory. Decompress database files before use
cd data/
gunzip *.gz
For convenience, you need to allow executable permissions for all scripts
# On the top of the directory
chmod +x OptTaxPro/OptTaxPro
find . -name '*.py' -type f | xargs chmod +x
OptTaxPro consists of FOUR main functions: preprocess, cluster, classify, profile
There is a main wrapper script OptTaxPro
in the OptTaxPro
directory
In addtion, there is an end-to-end function: alltheway
OptTaxPro {preprocess, cluster, classify, profile, alltheway} [options]
you can find details of required/optional parameters for each function with -h option.
OptTaxPro {preprocess, cluster, classify, profile, alltheway} -h
The input sequences are being filtered through adaptor removal, quality control, and singleton removal sequentially
export DATA=/path/to/your/data
export PRE=/path/to/filtered/reads
export NUM_CPU="the number of CPUs will be used"
# basic command
./OptTaxPro preprocess -i $DATA -o $PRE
# using $NUM_CPU threads
./OptTaxPro preprocess -i $DATA -o $PRE -p $NUM_CPU
# saving log file
./OptTaxPro preprocess -i $DATA -o $PRE -p $NUM_CPU --log preprocessing.log
# adjusting sequence length cutoffs
./OptTaxPro preprocess -i $DATA -o $PRE --min-len 1000 --max-len 1600
# adjusting sequence quality cutoff
./OptTaxPro preprocess -i $DATA -o $PRE --qc-cutoff 0.996
# allowing only exactly matched primers
./OptTaxPro preprocess -i $DATA -o $PRE -e 0.0
# adjusting singleton filtering cutoff
./OptTaxPro preprocess -i $DATA -o $PRE --single-cutoff 0.99
# allowing more threads for clustering
./OptTaxPro preprocess -i $DATA -o $PRE --single-threads 10
OTUs are going to be built using VSEARCH
export OTU=/path/to/OTUs
# basic command
./OptTaxPro cluster -i $PRE -o $OTU --suffix .clean.fastq
# saving clustering log
./OptTaxPro cluster -i $PRE -o $OTU --suffix .clean.fastq --log cluster.log
# adjusting clustering threshold
./OptTaxPro cluster -i $PRE -o $OTU --suffix .clean.fastq --otu-cutoff 0.97
# allowig more threads for clustering
./OptTaxPro cluster -i $PRE -o $OTU --suffix .clean.fastq --otu-threads 10
Taxonomy assignment is being conducted by homology search
export CLASSIFY=/path/to/classification/results
# basic command
./OptTaxPro classify -i $OTU -o $CLASSIFY
# using $NUM_CPU threads
./OptTaxPro classify -i $OTU -o $CLASSIFY -p $NUM_CPU
# saving classification logs
./OptTaxPro classify -i $OTU -o $CLASSIFY -p $NUM_CPU --log classify.log
# using custom DB
./OptTaxPro classify -i $OTU -o $CLASSIFY -p $NUM_CPU --db {PATH_TO_CUSTOM_DB}
# adjusting intermediate search cutoffs
./OptTaxPro classify -i $OTU -o $CLASSIFY -p $NUM_CPU --search-cutoffs 0.97 0.94 0.85 0.75 0.7
# adjusting assignment ranks
./OptTaxPro classify -i $OTU -o $CLASSIFY -p $NUM_CPU --ranks species genus family
# adjusting assignment cutoffs
./OptTaxPro classify -i $OTU -o $CLASSIFY -p $NUM_CPU --ranks species genus family --assign-cutoffs 99 94 86
# appending scientific_name column into output table
./OptTaxPro classify -i $OTU -o $CLASSIFY -p $NUM_CPU --add-name
# expanding results for all queries
./OptTaxPro classify -i $OTU -o $CLASSIFY -p $NUM_CPU --u_dir $OTU --uc-suffix .uc
Columns of taxonomy assignment
file:
Columns | Description |
---|---|
seqid | Query sequence ID |
rank | classified rank |
taxid | NCBI taxonomy ID of besthit(s) |
identity | identity of besthit(s) |
assigned | taxid if identity > cutoff (rank) else -1 |
scientific_name | translated assigned taxid (unclassified for -1) |
NOTE: 'scientific_name' column will only be added when '--add-name' option is given
Estimating taxonomy profile based on the taxonomy assignment results
export PROFILE=/path/to/profiling/outputs
# basic command
./OptTaxPro profile -i $CLASSIFY -o $PROFILE
# using $NUM_CPU threads
./OptTaxPro profile -i $CLASSIFY -o $PROFILE -p $NUM_CPU
# saving classification logs
./OptTaxPro profile -i $CLASSIFY -o $PROFILE -p $NUM_CPU --log profile.log
# convert taxid into scientific name
./OptTaxPro profile -i $CLASSIFY -o $PROFILE -p $NUM_CPU --base-col scientific_name
# report additional ranks
./OptTaxPro profile -i $CLASSIFY -o $PROFILE -p $NUM_CPU --base-col scientific_name --profile-ranks species HSG genus family
# apply multiple filtering cutoffs
./OptTaxPro profile -i $CLASSIFY -o $PROFILE -p $NUM_CPU --filtering-cutoffs 0 0.5 1.5
# adjusting filtering method
./OptTaxPro profile -i $CLASSIFY -o $PROFILE -p $NUM_CPU --filtering-pivot mean
# modify prefix of output tables
./OptTaxPro profile -i $CLASSIFY -o $PROFILE -p $NUM_CPU --output-prefix mysample
Users can simply run all the processes using one command
export DATA=/path/to/your/data
export OUTPUT=/path/to/profiling/outputs
export NUM_CPU="the number of CPUs will be used"
# basic command
./OptTaxPro alltheway -i $DATA -o $OUTPUT --base-col scientific_name --output-prefix mysample
# using #NUM_CPU threads
./OptTaxPro alltheway -i $DATA -o $OUTPUT --base-col scientific_name --output-prefix mysample -p $NUM_CPU
# saving all logs
./OptTaxPro alltheway -i $DATA -o $OUTPUT --base-col scientific_name --output-prefix mysample -p $NUM_CPU --log alltheway.log
Users can adjust and save all optional parameters using configuration file. An example of the configuration file is included in data directory. data/config.cfg
# preprocess
./OptTaxPro preprocess -c config.cfg
# cluster
./OptTaxPro cluster -c config.cfg
# classify
./OptTaxPro classify -c config.cfg
# profile
./OptTaxPro profile -c config.cfg
# alltheway
./OptTaxPro alltheway -c config.cfg
We provide a script for building custom Homologous Species Group (HSG). The scripts are placed in the scripts
directory.
This script conducts HSG building algorithm that takes 16S sequences as an input and an HSG table as output.
NOTE1. All sequences have to include NCBI accession number in the beginning of their header
(This script automatically detects and splits by their genus according to it)
USAGE
build_HSG.py \
-i examples/build_HSGs/three_genera.fna \
-a OptTaxPro/data/acc2taxid.txt \
-o three_genera.HSG.csv
This is an example of OptTaxPro workflow to classify the simulated datasets. Example data are given in the examples/OptTaxPro
directory. This is the same simulated sequences that used in the original article.
NOTE These simulated reads are all preprocessed ones. Thus, you DO NOT run preprocessing step for this (all reads are going to be ignored due to lack of primer sequences)
# making temporary directory
mkdir test_run
# Building OTUs first
./OptTaxPro/OptTaxPro cluster \
-i examples/OptTaxPro \
-o test_run/OTUs \
--suffix .pbsim.fasta \
--log test_run.log
# Performing Taxonomic assignments
./OptTaxPro/OptTaxPro classify \
-i test_run/OTUs \
-o test_run/classify \
--t_dir test_run \
--log test_run.log \
--search-cutoffs 0.97 \
--ranks species HSG genus \
--assign-cutoffs 97 97 94 \
--remove-self \
--u_dir test_run/OTUs \
--add-name
# Making profile table
./OptTaxPro/OptTaxPro profile \
-i test_run/classify \
-o test_run/profile \
--log test_run.log \
--profile-ranks species HSG genus \
--base-col scientific_name \
--output-prefix simulated
NOTE Before running OptTaxPro, download samples from NCBI SRA under the accession number (PRJNA933120) and save them in examples/PRJNA933120
directory.
# making temporary directory
mkdir test_realworld
# Run end-to-end process using config file
./OptTaxPro/OptTaxPro alltheway \
-c OptTaxPro/data/config.cfg
# On the top of the directory
./scripts/build_HSG.py \
-s examples/build_HSGs/three_genera.fna \
-a OptTaxPro/data/acc2taxid.txt \
-o HSG_three_genera.csv