Skip to content

FOI-Bioinformatics/merpcr

Repository files navigation

merPCR Logo

merPCR - Modern Electronic PCR Implementation

Tests codecov Python 3.8+ License: GPL v3

100% Compatible Python reimplementation of me-PCR

πŸ“– Documentation | πŸš€ Quick Start | βœ… Verification | πŸ”„ Migration Guide


Overview

merPCR locates Sequence-Tagged Sites (STS) within genomic sequences using computational PCR. It's a drop-in replacement for the original me-PCR (Multithreaded Electronic PCR), producing identical results while offering modern Python architecture, better error messages, and comprehensive documentation.

Key Highlights:

  • βœ… 100% Compatible - Verified byte-for-byte identical output to me-PCR v1.0.6
  • πŸš€ Drop-In Replacement - Same command-line interface, no changes needed
  • 🐍 Python API - Use programmatically in your Python workflows
  • πŸ“š Well Documented - Extensive guides, examples, and API reference
  • πŸ§ͺ Thoroughly Tested - 277 tests with 94% coverage for engine.py on real genomic data

Compatibility & Validation

merPCR has been extensively validated against me-PCR:

  • Real Genomic Data: Tested on 42 Francisellaceae genomes (90MB)
  • Compatibility Tests: 15/15 passed (100%)
  • Output Identity: MD5 checksums match exactly
  • Critical Fixes: Three algorithmic differences identified and fixed

Three critical compatibility fixes were implemented:

  1. Hash computation - Backward vs forward search
  2. PCR range margins - Proper margin adjustment for size ranges
  3. Forward strand matching - Reverse complement handling for primer2

πŸ“„ Full verification details: docs/VERIFICATION.md


Quick Start

Installation

# From PyPI (when available)
pip install merpcr

# From source
git clone https://github.com/FOI-Bioinformatics/merpcr.git
cd merpcr
pip install -e .

Basic Usage

# Use merPCR exactly like me-PCR
merpcr primers.sts genome.fa

# With parameters (both formats supported)
merpcr primers.sts genome.fa -M 50 -N 1 -O results.txt
merpcr primers.sts genome.fa M=50 N=1 O=results.txt  # Legacy format works too!

# Multiple parameters, mixed formats
merpcr primers.sts genome.fa -M 50 -N 1 -T 4 --debug

Python API

from merpcr import MerPCR

# Initialize with parameters
engine = MerPCR(wordsize=11, margin=50, mismatches=1, threads=4)

# Load data and search
engine.load_sts_file("primers.sts")
records = engine.load_fasta_file("genome.fa")
hit_count = engine.search(records, "results.txt")

print(f"Found {hit_count} hits")

πŸ“– More examples: docs/EXAMPLES.md


Key Features

Computational Features

  • Multithreaded Processing - Automatic thread scaling for large files
  • Hash-Based Search - O(1) STS lookup with 2-bit encoding
  • IUPAC Support - Optional ambiguity code handling
  • Flexible Parameters - Configurable margins, mismatches, and word sizes
  • 3' Protection - Prevents mismatches in primer 3' regions

Software Features

  • Modern Architecture - Type-safe Python with comprehensive error handling
  • Better Diagnostics - Clear error messages with context
  • Debug Mode - Detailed logging for troubleshooting
  • Extensive Testing - 277 tests covering edge cases and real data
  • CI/CD Pipeline - Automated testing on multiple platforms

Documentation

For Users

For Developers

Quick Reference

Input Format (STS file):

STS_ID	Forward_Primer	Reverse_Primer	PCR_Size	[Optional_Alias]

Output Format:

Sequence_ID	pos1..pos2	STS_ID	Alias	(+/-)

Common Parameters:

  • -M, --margin - Search margin in bp (default: 50)
  • -N, --mismatches - Allowed mismatches (default: 0)
  • -W, --wordsize - Hash word size (default: 11)
  • -T, --threads - Number of threads (default: 1)
  • -O, --output - Output file (default: stdout)
  • --debug - Enable debug logging

πŸ“– Full parameter list: docs/USER_GUIDE.md#parameters


Testing

# Run all tests
make test

# Run specific test categories
make test-unit          # Unit tests only
make test-integration   # Integration tests
make test-performance   # Performance benchmarks

# Generate coverage report
make coverage

# Run compatibility tests
python test_compatibility.py

Current Status:

  • 277 tests (all passing)
  • 94% code coverage for engine.py (critical component)
  • Real genomic data validation on 42 genomes

Performance

merPCR now matches or exceeds me-PCR performance with Cython optimization:

Dataset Size me-PCR merPCR (Pure Python) merPCR (Cython) Speedup
Small (<2MB) ~0.5s ~0.5s ~0.4s 2.1x
Medium (2-4MB) ~0.8s ~0.8s ~0.3s 2.6-2.9x
Large (>4MB) Scales linearly Scales linearly Scales linearly (2.9x faster) 2.9x+

Real Genomic Data (Francisellaceae genomes):

  • F. tularensis (1.8 MB): 0.24s with Cython vs 0.51s pure Python (2.1x speedup)
  • C. litorale (3.1 MB): 0.30s with Cython vs 0.86s pure Python (2.8x speedup)
  • F. hongkongensis (2.8 MB): 0.27s with Cython vs 0.80s pure Python (2.9x speedup)

Average speedup: 2.65x faster than pure Python!

πŸš€ Performance Features:

  • Automatic Cython optimization (if available)
  • Seamless fallback to pure Python
  • Multithreading support for large files
  • NumPy-accelerated lookup tables

πŸ“Š Full performance details: docs/PERFORMANCE.md


Project Structure

merpcr/
β”œβ”€β”€ src/merpcr/          # Main package
β”‚   β”œβ”€β”€ cli.py           # Command-line interface
β”‚   β”œβ”€β”€ core/            # Core functionality
β”‚   β”‚   β”œβ”€β”€ engine.py    # Search engine
β”‚   β”‚   β”œβ”€β”€ models.py    # Data models
β”‚   β”‚   └── utils.py     # Utilities
β”‚   └── io/              # Input/output
β”‚       β”œβ”€β”€ fasta.py     # FASTA handling
β”‚       └── sts.py       # STS handling
β”œβ”€β”€ tests/               # Comprehensive test suite (277 tests)
β”œβ”€β”€ docs/                # Documentation
β”‚   β”œβ”€β”€ USER_GUIDE.md    # Usage documentation
β”‚   β”œβ”€β”€ API.md           # API reference
β”‚   β”œβ”€β”€ EXAMPLES.md      # Practical examples
β”‚   β”œβ”€β”€ VERIFICATION.md  # Compatibility verification
β”‚   β”œβ”€β”€ MIGRATION.md     # Migration from me-PCR
β”‚   └── CI_CD.md         # CI/CD documentation
β”œβ”€β”€ pyproject.toml       # Modern Python packaging
└── Makefile             # Development commands

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new features
  4. Ensure all tests pass (make test)
  5. Format code (make format)
  6. Submit a pull request

Development Setup:

git clone https://github.com/FOI-Bioinformatics/merpcr.git
cd merpcr
make dev-install  # Install with development dependencies
make test         # Verify installation

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.


Acknowledgments

merPCR builds upon the pioneering work of:

  • Gregory D. Schuler (NCBI) - Original e-PCR algorithm development
  • Kevin Murphy (Children's Hospital of Philadelphia) - me-PCR multithreading enhancements

References

  1. Schuler, G.D. (1997) "Sequence mapping by electronic PCR." Genome Research 7: 541-550. doi:10.1101/gr.7.5.541

  2. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. (1990) "Basic local alignment search tool." Journal of Molecular Biology 215: 403-410. doi:10.1016/S0022-2836(05)80360-2

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published