Skip to content

Codymwon/AbyssML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI-Driven Hybrid eDNA Classifier for Biodiversity Assessment

Python Version

This project provides a complete, end-to-end pipeline for the taxonomic classification of environmental DNA (eDNA) sequences. It is specifically designed to address the challenges of analyzing eDNA from environments with poor database representation, such as the deep sea.

The pipeline uses a sophisticated hybrid AI approach:

  1. A Supervised Deep Learning Model (CNN with ResNet and Attention) is trained to accurately classify known eukaryotic taxa based on their V4 barcode region.
  2. An Unsupervised Deep Learning Model (a sequence autoencoder) learns to create numerical "fingerprints" of sequences, which are then used with HDBSCAN to discover and cluster novel, unknown taxa.
  3. An Integrated Hybrid Engine intelligently routes sequences, using the supervised model as a "gatekeeper" and the unsupervised model for discovery, providing a comprehensive analysis of both known and unknown biodiversity.

📂 Project Structure

The project is organized into a clean, modular structure for clarity and reproducibility.

eDNA-classifier/
├── data/
│   ├── raw/
│   │   └── SILVA_DATABASE.fasta      # Place raw database file here
│   └── processed/
│       └── training_barcodes.fasta   # Generated by the 'prepare-data' step
├── models/
│   ├── supervised/                   # Stores the trained supervised model
│   └── unsupervised/                 # Stores the trained unsupervised model
├── results/
│   └── sample_results.csv            # Example output file
├── src/
│   ├── data_preparation.py           # Module for filtering and V4 extraction
│   ├── supervised_model.py           # Module defining the supervised classifier
│   ├── unsupervised_model.py         # Module defining the unsupervised autoencoder
│   └── pipeline.py                   # Module for the final hybrid engine
├── main.py                           # The main controller script to run all commands
├── requirements.txt                  # Project dependencies
└── README.md                         # This file

🚀 Setup and Installation

Follow these steps to set up the project environment.

  1. Clone the repository:

    git clone <your-repository-url>
    cd eDNA-classifier
  2. Create and activate a Python virtual environment:

    python3 -m venv .venv
    source .venv/bin/activate
  3. Install the required libraries:

    pip install -r requirements.txt
  4. Add Your Data: Place your raw, full reference database (e.g., the FASTA file downloaded from SILVA) inside the data/raw/ directory.


Workflow and Usage

The entire pipeline is controlled by main.py using simple commands. The steps must be run in the following order.

Step 1: Prepare the Training Data

This is a one-time step that reads your raw database, filters for Eukaryote sequences, and extracts the V4 barcode region. This may take a while.

python main.py prepare-data \
    --input_fasta data/raw/SILVA_DATABASE.fasta \
    --output_fasta data/processed/training_barcodes.fasta

Step 2: Train the Supervised Model

This trains the expert classifier on the clean V4 barcode dataset. Training will stop automatically when the model's performance plateaus.

python main.py train-supervised --fasta_file data/processed/training_barcodes.fasta

This will save the trained model to models/supervised/.

Step 3: Train the Unsupervised Model

This trains the autoencoder which learns to create numerical "fingerprints" of the V4 sequences.

python main.py train-unsupervised --fasta_file data/processed/training_barcodes.fasta

This will save the trained encoder to models/unsupervised/.

Step 4: Run the Pipeline on a Sample

Once both models are trained, you can use the pipeline to analyze your own eDNA samples.

python main.py run-pipeline \
    --fasta_file path/to/your/sample.fasta \
    --output_file results/my_sample_results.csv

You can tune the pipeline's behavior with additional arguments:

python main.py run-pipeline \
    --fasta_file path/to/your/sample.fasta \
    --output_file results/my_sample_results.csv \
    --confidence_threshold 0.7 \
    --min_cluster_size 5

📈 Understanding the Output

The run-pipeline command produces two reports:

  1. A CSV file in the results/ folder, which contains the main classification data with the following columns:

    • Sequence_ID: The original identifier from your sample FASTA file.
    • Status: The final classification status.
      • Classified: The supervised model made a high-confidence prediction.
      • Novel_Cluster: The sequence had low confidence and was grouped with other similar unknown sequences.
      • Unclustered: The sequence had low confidence and was not similar enough to any other sequence to form a cluster (noise).
    • Prediction: The predicted species name or the assigned novel cluster ID (e.g., Novel_Cluster_0).
    • Confidence: The probability score (0.0 to 1.0) from the supervised model.
  2. A report in your console summarizing the results and listing any sequences that were removed by the initial Quality Control filter and the reason for their removal.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages