Skip to content

gl0228/probabilistic-blast-matcher

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Probabilistic Sequence Matcher with BLAST + Heuristic Filtering

This project integrates BLAST sequence alignment with a custom probability-based scoring system to identify the most likely matches between a query DNA sequence and a database sequence, using positional nucleotide probabilities.

What It Does

  1. Runs BLAST (blastn) to find approximate matches between a query and a database sequence.
  2. Extends each BLAST match to align fully with the query sequence.
  3. Calculates log-probabilities of each extended match using a per-base probability model.
  4. Ranks matches based on their total log-probability.
  5. Optionally applies a heuristic filter to reject matches with too many low-probability bases in a row.
  6. Outputs a CSV file with all ranked matches and highlights the most probable match.

Input Files

  • database.fasta: The reference DNA sequence
  • query_sequence.fasta: The query DNA sequence
  • database_probs.txt: A list of probabilities (one per base in the database)

Output Files

  • blast_results.tsv: Raw BLAST tabular results
  • blast_results.csv: Formatted BLAST results with headers
  • final_result.csv: Ranked matches with log-probabilities and positional info

Example Output

The max match is  (132, 163)
The found string is  ATCGGATCCTGACGTACGTAGTCGATCGTAGCTAGTCAGT




How to run the algorithm?
1) Install BLAST algorithm

    To install `makeblastdb` (part of the BLAST+ suite) on your MacBook, you have several methods to choose from. The most straightforward and recommended approaches are:

    1. **Using Homebrew (Recommended)**
    2. **Using Conda (Bioconda)**
    3. **Manual Installation from NCBI**

    Below, I’ll guide you through each method step-by-step. Choose the one that best fits your setup and preferences.

    ---

    ## **Method 1: Installing BLAST+ Using Homebrew (Recommended)**

    **Homebrew** is a popular package manager for macOS that simplifies the installation of software. If you don’t have Homebrew installed, you can install it first.

    ### **Step 1: Install Homebrew (If Not Already Installed)**

    1. **Open Terminal**:
    - You can find Terminal via Spotlight Search (`Cmd + Space`) by typing "Terminal".

    2. **Install Homebrew**:
    - Paste the following command into Terminal and press `Enter`:
        ```bash
        /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
        ```
    - **Follow the On-Screen Instructions**:
        - You may be prompted to enter your password.
        - Press `Return` to continue when prompted.

    3. **Verify Homebrew Installation**:
    ```bash
    brew --version
    ```
    - You should see output like `Homebrew 4.x.x` indicating the version installed.

    ### **Step 2: Install BLAST+ via Homebrew**

    1. **Update Homebrew**:
    - It’s good practice to ensure Homebrew is up-to-date.
        ```bash
        brew update
        ```

    2. **Install BLAST+**:
    ```bash
    brew install blast
    ```
    - This command installs the BLAST+ suite, including `makeblastdb`, `blastn`, `blastp`, and other BLAST tools.

    3. **Verify Installation**:
    - Check if `makeblastdb` is accessible:
        ```bash
        makeblastdb -version
        ```
    - **Expected Output**:
        - You should see version information, e.g., `makeblastdb: 2.14.0+`

    4. **Ensure BLAST+ is in Your PATH**:
    - Homebrew typically handles this automatically. To confirm:
        ```bash
        which makeblastdb
        ```
    - **Expected Output**:
        - A path like `/usr/local/bin/makeblastdb` or `/opt/homebrew/bin/makeblastdb` depending on your Mac architecture.
  
2) Note: The database files in the database folder were created using `makeblastdb -in database.fasta -dbtype nucl -out my_database`. These files are needed for the BLAST library and there's no need to modify them.
3) Generate a random sequence from the database using `random_generator.py`. The randomly generated sequence is automatically written into `query_sequence.fasta`. You can manually do substitutions/indels to the sequence. You can also write pick your own sequence and write it into `query_sequence.fasta`. 
4) In `blast.py`, you can decide if you want to turn on/off the `HEURISTIC`. The heuristic discards a match if a `HEURISTIC_NUM` of consecutive nucleotides have a probability of `HEURISTIC_PROB` or below.
5) Go to `blast.py` and run the file. This program should identify where the query sequence is located in the database by printing (start_index, end_index) of the match with the highest probability.
6) `final_result.csv` contains the matches of the algorithm, ranked from highest to lowest probability. The first match is the most probably match.
7) `blast_results.csv` contains the normal BLAST algorithm matches generated by the BLAST library.

About

Rank and filter BLAST DNA sequence matches using probabilities and a heuristic scoring model.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •