AI-Driven Hybrid eDNA Classifier for Biodiversity Assessment

This project provides a complete, end-to-end pipeline for the taxonomic classification of environmental DNA (eDNA) sequences. It is specifically designed to address the challenges of analyzing eDNA from environments with poor database representation, such as the deep sea.

The pipeline uses a sophisticated hybrid AI approach:

A Supervised Deep Learning Model (CNN with ResNet and Attention) is trained to accurately classify known eukaryotic taxa based on their V4 barcode region.
An Unsupervised Deep Learning Model (a sequence autoencoder) learns to create numerical "fingerprints" of sequences, which are then used with HDBSCAN to discover and cluster novel, unknown taxa.
An Integrated Hybrid Engine intelligently routes sequences, using the supervised model as a "gatekeeper" and the unsupervised model for discovery, providing a comprehensive analysis of both known and unknown biodiversity.

📂 Project Structure

The project is organized into a clean, modular structure for clarity and reproducibility.

eDNA-classifier/
├── data/
│   ├── raw/
│   │   └── SILVA_DATABASE.fasta      # Place raw database file here
│   └── processed/
│       └── training_barcodes.fasta   # Generated by the 'prepare-data' step
├── models/
│   ├── supervised/                   # Stores the trained supervised model
│   └── unsupervised/                 # Stores the trained unsupervised model
├── results/
│   └── sample_results.csv            # Example output file
├── src/
│   ├── data_preparation.py           # Module for filtering and V4 extraction
│   ├── supervised_model.py           # Module defining the supervised classifier
│   ├── unsupervised_model.py         # Module defining the unsupervised autoencoder
│   └── pipeline.py                   # Module for the final hybrid engine
├── main.py                           # The main controller script to run all commands
├── requirements.txt                  # Project dependencies
└── README.md                         # This file

🚀 Setup and Installation

Follow these steps to set up the project environment.

Clone the repository:

git clone <your-repository-url>
cd eDNA-classifier

Create and activate a Python virtual environment:

python3 -m venv .venv
source .venv/bin/activate

Install the required libraries:
```
pip install -r requirements.txt
```
Add Your Data: Place your raw, full reference database (e.g., the FASTA file downloaded from SILVA) inside the data/raw/ directory.

Workflow and Usage

The entire pipeline is controlled by main.py using simple commands. The steps must be run in the following order.

Step 1: Prepare the Training Data

This is a one-time step that reads your raw database, filters for Eukaryote sequences, and extracts the V4 barcode region. This may take a while.

python main.py prepare-data \
    --input_fasta data/raw/SILVA_DATABASE.fasta \
    --output_fasta data/processed/training_barcodes.fasta

Step 2: Train the Supervised Model

This trains the expert classifier on the clean V4 barcode dataset. Training will stop automatically when the model's performance plateaus.

python main.py train-supervised --fasta_file data/processed/training_barcodes.fasta

This will save the trained model to models/supervised/.

Step 3: Train the Unsupervised Model

This trains the autoencoder which learns to create numerical "fingerprints" of the V4 sequences.

python main.py train-unsupervised --fasta_file data/processed/training_barcodes.fasta

This will save the trained encoder to models/unsupervised/.

Step 4: Run the Pipeline on a Sample

Once both models are trained, you can use the pipeline to analyze your own eDNA samples.

python main.py run-pipeline \
    --fasta_file path/to/your/sample.fasta \
    --output_file results/my_sample_results.csv

You can tune the pipeline's behavior with additional arguments:

python main.py run-pipeline \
    --fasta_file path/to/your/sample.fasta \
    --output_file results/my_sample_results.csv \
    --confidence_threshold 0.7 \
    --min_cluster_size 5

📈 Understanding the Output

The run-pipeline command produces two reports:

A CSV file in the results/ folder, which contains the main classification data with the following columns:
- Sequence_ID: The original identifier from your sample FASTA file.
- Status: The final classification status.
  - Classified: The supervised model made a high-confidence prediction.
  - Novel_Cluster: The sequence had low confidence and was grouped with other similar unknown sequences.
  - Unclustered: The sequence had low confidence and was not similar enough to any other sequence to form a cluster (noise).
- Prediction: The predicted species name or the assigned novel cluster ID (e.g., Novel_Cluster_0).
- Confidence: The probability score (0.0 to 1.0) from the supervised model.
A report in your console summarizing the results and listing any sequences that were removed by the initial Quality Control filter and the reason for their removal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI-Driven Hybrid eDNA Classifier for Biodiversity Assessment

📂 Project Structure

🚀 Setup and Installation

Workflow and Usage

Step 1: Prepare the Training Data

Step 2: Train the Supervised Model

Step 3: Train the Unsupervised Model

Step 4: Run the Pipeline on a Sample

📈 Understanding the Output

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data/processed		data/processed
src		src
test_data		test_data
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Codymwon/AbyssML

Folders and files

Latest commit

History

Repository files navigation

AI-Driven Hybrid eDNA Classifier for Biodiversity Assessment

📂 Project Structure

🚀 Setup and Installation

Workflow and Usage

Step 1: Prepare the Training Data

Step 2: Train the Supervised Model

Step 3: Train the Unsupervised Model

Step 4: Run the Pipeline on a Sample

📈 Understanding the Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages