This project provides a complete, end-to-end pipeline for the taxonomic classification of environmental DNA (eDNA) sequences. It is specifically designed to address the challenges of analyzing eDNA from environments with poor database representation, such as the deep sea.
The pipeline uses a sophisticated hybrid AI approach:
- A Supervised Deep Learning Model (CNN with ResNet and Attention) is trained to accurately classify known eukaryotic taxa based on their V4 barcode region.
- An Unsupervised Deep Learning Model (a sequence autoencoder) learns to create numerical "fingerprints" of sequences, which are then used with HDBSCAN to discover and cluster novel, unknown taxa.
- An Integrated Hybrid Engine intelligently routes sequences, using the supervised model as a "gatekeeper" and the unsupervised model for discovery, providing a comprehensive analysis of both known and unknown biodiversity.
The project is organized into a clean, modular structure for clarity and reproducibility.
eDNA-classifier/
├── data/
│ ├── raw/
│ │ └── SILVA_DATABASE.fasta # Place raw database file here
│ └── processed/
│ └── training_barcodes.fasta # Generated by the 'prepare-data' step
├── models/
│ ├── supervised/ # Stores the trained supervised model
│ └── unsupervised/ # Stores the trained unsupervised model
├── results/
│ └── sample_results.csv # Example output file
├── src/
│ ├── data_preparation.py # Module for filtering and V4 extraction
│ ├── supervised_model.py # Module defining the supervised classifier
│ ├── unsupervised_model.py # Module defining the unsupervised autoencoder
│ └── pipeline.py # Module for the final hybrid engine
├── main.py # The main controller script to run all commands
├── requirements.txt # Project dependencies
└── README.md # This file
Follow these steps to set up the project environment.
-
Clone the repository:
git clone <your-repository-url> cd eDNA-classifier
-
Create and activate a Python virtual environment:
python3 -m venv .venv source .venv/bin/activate
-
Install the required libraries:
pip install -r requirements.txt
-
Add Your Data: Place your raw, full reference database (e.g., the FASTA file downloaded from SILVA) inside the
data/raw/
directory.
The entire pipeline is controlled by main.py
using simple commands. The steps must be run in the following order.
This is a one-time step that reads your raw database, filters for Eukaryote sequences, and extracts the V4 barcode region. This may take a while.
python main.py prepare-data \
--input_fasta data/raw/SILVA_DATABASE.fasta \
--output_fasta data/processed/training_barcodes.fasta
This trains the expert classifier on the clean V4 barcode dataset. Training will stop automatically when the model's performance plateaus.
python main.py train-supervised --fasta_file data/processed/training_barcodes.fasta
This will save the trained model to models/supervised/
.
This trains the autoencoder which learns to create numerical "fingerprints" of the V4 sequences.
python main.py train-unsupervised --fasta_file data/processed/training_barcodes.fasta
This will save the trained encoder to models/unsupervised/
.
Once both models are trained, you can use the pipeline to analyze your own eDNA samples.
python main.py run-pipeline \
--fasta_file path/to/your/sample.fasta \
--output_file results/my_sample_results.csv
You can tune the pipeline's behavior with additional arguments:
python main.py run-pipeline \
--fasta_file path/to/your/sample.fasta \
--output_file results/my_sample_results.csv \
--confidence_threshold 0.7 \
--min_cluster_size 5
The run-pipeline
command produces two reports:
-
A CSV file in the
results/
folder, which contains the main classification data with the following columns:Sequence_ID
: The original identifier from your sample FASTA file.Status
: The final classification status.Classified
: The supervised model made a high-confidence prediction.Novel_Cluster
: The sequence had low confidence and was grouped with other similar unknown sequences.Unclustered
: The sequence had low confidence and was not similar enough to any other sequence to form a cluster (noise).
Prediction
: The predicted species name or the assigned novel cluster ID (e.g.,Novel_Cluster_0
).Confidence
: The probability score (0.0 to 1.0) from the supervised model.
-
A report in your console summarizing the results and listing any sequences that were removed by the initial Quality Control filter and the reason for their removal.