Skip to content

An NLP-based selector designed to search for specific sequences, markers, genes among DNA/RNA long-reads.

License

Notifications You must be signed in to change notification settings

ZAEDPolSl/noMapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

                  888b     d888                                             
                  8888b   d8888                                             
                  88888b.d88888                                             
88888b.   .d88b.  888Y88888P888  8888b.  88888b.  88888b.   .d88b.  888d888 
888 "88b d88""88b 888 Y888P 888     "88b 888 "88b 888 "88b d8P  Y8b 888P"   
888  888 888  888 888  Y8P  888 .d888888 888  888 888  888 88888888 888     
888  888 Y88..88P 888   "   888 888  888 888 d88P 888 d88P Y8b.     888     
888  888  "Y88P"  888       888 "Y888888 88888P"  88888P"   "Y8888  888     
                                         888      888                       
                                         888      888                       
                                         888      888                       

Overview

NoMapper is an NLP-based selector designed to search for specific sequences, markers, genes among DNA/RNA long-reads. Unlike traditional methods, NoMapper does not use alignment algorithms like Needleman-Wunsch or Smith-Waterman in its work. Instead, the entire concept of the system is based on the "noMapping mapping" approach, allowing for creating efficient sequence selector.

Getting started

To use it, navigate to the "Use noMapper" section and follow the provided instructions. The tool requires a model and an encoder specifically tailored for the search of the desired sequence, marker, gene. For instance, if you are interested in searching for the FDXR gene, you can download and use the example NoMapper by obtaining the necessary files model.h5 and cv.pkl (link). If you wish to search for sequences other than FDXR, you should go to the "Prepare noMapper" section. Here, you will find detailed instructions on how to train NoMapper to suit your specific needs. This flexibility allows you to adapt NoMapper for various genetic markers, enabling a wide range of applications.

Use noMapper

  1. Go to the relevant directory
    cd docker/nomapper/
  2. Insert in the vol/ directory
    • model.h5 - the trained model
    • cv.pkl - the encoder
  3. Set the configuration file vol/config.ini (if a custom configuration was used when training the model in "Prepare noMapper")
  4. Download latest stable version
    docker pull drdext3r/nomapper
    or build
    docker build . -t drdext3r/nomapper:latest
  5. Run the docker container
    docker run -it -v $(pwd)/vol:/vol -p 8000:8000 --name nomapper drdext3r/nomapper:latest
  6. Predict in a new window
    curl -X POST "http://127.0.0.1:8000/predict/" -H "accept: application/json" -H "Content-Type: application/json" -d '{"seq": "<long-read>"}'
    Outputs:
    {"result":"found"}
    {"result":"not found"}
  7. Exit the docker container
    exit

Prepare noMapper

  1. Go to the relevant directory
    cd docker/nomapper-maker/
  2. Insert in the vol/ directory:
    • input.fastq.gz - sequences for training the model
    • ref.fna - a reference, i.e. the sequence you want to search for (an example)
  3. Download latest stable version
    docker pull drdext3r/nomapper-maker
    or build
    docker build . -t drdext3r/nomapper-maker:latest
  4. Run the docker container
    docker run -it -v $(pwd)/vol:/vol --name nomapper-maker drdext3r/nomapper-maker:latest
  5. Train the model
    with default configuration
    do-all
    or custom configuration
    preprocess
    python3 train.py --help     # show configuration
    python3 train.py --kmer_size=4 
    chmod 777 /vol/model.h5
  6. Clean out unnecessary files (optional)
    clean-all
  7. Exit the docker container
    exit

Outcome

  • model.h5 - the trained model
  • cv.pkl - the encoder

About

An NLP-based selector designed to search for specific sequences, markers, genes among DNA/RNA long-reads.

Resources

License

Stars

Watchers

Forks