This project integrates BLAST sequence alignment with a custom probability-based scoring system to identify the most likely matches between a query DNA sequence and a database sequence, using positional nucleotide probabilities.
- Runs BLAST (blastn) to find approximate matches between a query and a database sequence.
- Extends each BLAST match to align fully with the query sequence.
- Calculates log-probabilities of each extended match using a per-base probability model.
- Ranks matches based on their total log-probability.
- Optionally applies a heuristic filter to reject matches with too many low-probability bases in a row.
- Outputs a CSV file with all ranked matches and highlights the most probable match.
database.fasta
: The reference DNA sequencequery_sequence.fasta
: The query DNA sequencedatabase_probs.txt
: A list of probabilities (one per base in the database)
blast_results.tsv
: Raw BLAST tabular resultsblast_results.csv
: Formatted BLAST results with headersfinal_result.csv
: Ranked matches with log-probabilities and positional info
The max match is (132, 163)
The found string is ATCGGATCCTGACGTACGTAGTCGATCGTAGCTAGTCAGT
How to run the algorithm?
1) Install BLAST algorithm
To install `makeblastdb` (part of the BLAST+ suite) on your MacBook, you have several methods to choose from. The most straightforward and recommended approaches are:
1. **Using Homebrew (Recommended)**
2. **Using Conda (Bioconda)**
3. **Manual Installation from NCBI**
Below, I’ll guide you through each method step-by-step. Choose the one that best fits your setup and preferences.
---
## **Method 1: Installing BLAST+ Using Homebrew (Recommended)**
**Homebrew** is a popular package manager for macOS that simplifies the installation of software. If you don’t have Homebrew installed, you can install it first.
### **Step 1: Install Homebrew (If Not Already Installed)**
1. **Open Terminal**:
- You can find Terminal via Spotlight Search (`Cmd + Space`) by typing "Terminal".
2. **Install Homebrew**:
- Paste the following command into Terminal and press `Enter`:
```bash
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
```
- **Follow the On-Screen Instructions**:
- You may be prompted to enter your password.
- Press `Return` to continue when prompted.
3. **Verify Homebrew Installation**:
```bash
brew --version
```
- You should see output like `Homebrew 4.x.x` indicating the version installed.
### **Step 2: Install BLAST+ via Homebrew**
1. **Update Homebrew**:
- It’s good practice to ensure Homebrew is up-to-date.
```bash
brew update
```
2. **Install BLAST+**:
```bash
brew install blast
```
- This command installs the BLAST+ suite, including `makeblastdb`, `blastn`, `blastp`, and other BLAST tools.
3. **Verify Installation**:
- Check if `makeblastdb` is accessible:
```bash
makeblastdb -version
```
- **Expected Output**:
- You should see version information, e.g., `makeblastdb: 2.14.0+`
4. **Ensure BLAST+ is in Your PATH**:
- Homebrew typically handles this automatically. To confirm:
```bash
which makeblastdb
```
- **Expected Output**:
- A path like `/usr/local/bin/makeblastdb` or `/opt/homebrew/bin/makeblastdb` depending on your Mac architecture.
2) Note: The database files in the database folder were created using `makeblastdb -in database.fasta -dbtype nucl -out my_database`. These files are needed for the BLAST library and there's no need to modify them.
3) Generate a random sequence from the database using `random_generator.py`. The randomly generated sequence is automatically written into `query_sequence.fasta`. You can manually do substitutions/indels to the sequence. You can also write pick your own sequence and write it into `query_sequence.fasta`.
4) In `blast.py`, you can decide if you want to turn on/off the `HEURISTIC`. The heuristic discards a match if a `HEURISTIC_NUM` of consecutive nucleotides have a probability of `HEURISTIC_PROB` or below.
5) Go to `blast.py` and run the file. This program should identify where the query sequence is located in the database by printing (start_index, end_index) of the match with the highest probability.
6) `final_result.csv` contains the matches of the algorithm, ranked from highest to lowest probability. The first match is the most probably match.
7) `blast_results.csv` contains the normal BLAST algorithm matches generated by the BLAST library.