MDSearch is a Python tool designed to identify minimal sets of Single Nucleotide Polymorphisms (SNPs) that can discriminate between samples in a VCF (Variant Call Format) file based on a specified minimum Hamming distance. The tool is particularly useful for applications such as genetic barcoding, sample identification, and quality control in genomic studies.
- Robust genotype handling: Correctly processes missing genotypes and polyploid SNPs
- Customizable discrimination: Set minimum Hamming distance between samples
- Multiple SNP sets: Generate several distinct (complitely disjoint) discriminatory SNP sets
- Advanced scoring: Customizable entropy and Minor Allele Frequency (MAF) weights
- Comprehensive output: Human-readable SNP profiles, statistics, and VCF files
- Memory monitoring: Dynamic memory usage tracking with system-aware thresholds
The codebase implements several sophisticated algorithms:
- Two-phase optimization: Forward selection followed by backward elimination for minimal SNP sets
- Vectorized Hamming distance calculation: Optimized pairwise distance computation using NumPy arrays
- Entropy-based SNP scoring: Combines Shannon entropy and MAF for optimal SNP selection
- Incremental distance updates: Efficient backward elimination without full set reconstruction
- Memory-aware processing: Dynamic memory monitoring with graceful degradation
- Deterministic SNP selection: Stable, reproducible results with tie-breaking by SNP ID
- pysam: High-performance VCF/BCF file reading and writing
- NumPy: Vectorized numerical operations for genotype matrices
- bcftools: VCF/BCF indexing and format support
- psutil: Cross-platform system and process monitoring
- pytest: Comprehensive testing framework with 29+ test cases
MDSearch/
├── src/ # Core application code
│ ├── core/ # Core algorithms and data structures
│ │ ├── vcf_parser.py # VCF file parsing and validation
│ │ ├── distance_calculator.py # Optimized Hamming distance computation
│ │ ├── genotype_utils.py # Genotype conversion and utility functions
│ │ └── snp_selector.py # SNP selection and optimization algorithms
│ ├── io/ # Input/output operations
│ │ ├── vcf_writer.py # VCF file output with format support
│ │ ├── summary_writer.py # Statistical summary generation
│ │ ├── run_info_writer.py # Execution information and metadata
│ │ ├── snp_profile_writer.py # Human-readable SNP profiles
│ │ └── structure_info_writer.py # Output directory documentation
│ └── utils/ # Utility functions
│ ├── memory_monitor.py # Memory usage monitoring and warnings
│ ├── logging_setup.py # Structured logging configuration
│ ├── indexing.py # VCF indexing utilities
│ └── validation.py # Input validation functions
├── tests/ # Comprehensive test suite
├── environment.yml # Conda environment specification
└── mdsearch.py # Main command-line interface
Key directories:
src/core/
: Contains the core algorithms for SNP selection, distance calculation, and genotype processingsrc/io/
: Handles all output generation including VCF files, summaries, and human-readable profilestests/
: Comprehensive test suite covering genotype handling, distance calculations, and CLI validation
git clone https://github.com/alermol/MDSearch.git
cd MDSearch
conda env create -f environment.yml
conda activate mdsearch
python mdsearch.py input.vcf results -md 2 -ns 3
python mdsearch.py input.vcf output_folder -pl 2 -md 2 -ns 3
This will:
- Assume that input VCF file
input.vcf
is diploid - Ensure that minimum Hamming distance of 2 between samples in all final discriminatory VCF sets
- Generate 3 distinct discriminatory SNP sets
- Create comprehensive output in the
output_folder
python mdsearch.py IVCF OUTPUT_FOLDER [options]
- IVCF: Input VCF file (bi-allelic, with SNP IDs)
- OUTPUT_FOLDER: Path to output folder (will be created if absent)
-pl, --ploidy PLOIDY
(default: 2) — VCF ploidy-md, --min-distance MIN_DIST
(default: 1) — Minimal Hamming distance between samples-ns, --num-sets N_SETS
(default: 0) — Number of distinct SNP sets in output (0 or 'unlimited' for all possible sets)-we, --weight-entropy WEIGHT_ENTROPY
(default: 0.5) — Weight for entropy component in SNP scoring (0.0 to 1.0)-wm, --weight-maf WEIGHT_MAF
(default: 0.5) — Weight for MAF component in SNP scoring (0.0 to 1.0)
-ch, --convert-het
— Convert heterozygous calls into NA
-I, --input-format {auto,v,z,u,b}
(default: auto) — Input format (VCF/VCF.gz/BCF uncompressed/BCF compressed)-O, --output-format {v,z,u,b}
(default: v) — Output format (VCF/VCF.gz/BCF uncompressed/BCF compressed)
-q, --quiet
— Suppress progress output-L, --log-level {DEBUG,INFO,WARNING,ERROR}
— Logging level-F, --log-format {text,json}
(default: text) — Logging format
-V, --version
— Show program version, commit hash, and Python version and exit-h, --help
— Show help message and exit
python mdsearch.py data.vcf results -pl 2 -md 2 -ch -ns 3
python mdsearch.py data.vcf results -pl 2 -md 2 -we 0.8 -wm 0.2
python mdsearch.py data.vcf results -pl 2 -md 2 -ch -ns 0
# or
python mdsearch.py data.vcf results -pl 2 -md 2 -ch -ns unlimited
python mdsearch.py data.vcf results -md 2 -ns 3 -L INFO
python mdsearch.py data.vcf results -F json -q
python mdsearch.py input.bcf results -I b -O b
The tool creates a well-organized output directory with comprehensive analysis results:
results/
├── mdss/ # SNP set VCF files
│ ├── minimal_set_1.vcf # First minimal discriminative set
│ ├── minimal_set_2.vcf # Alternative minimal set
│ └── minimal_set_3.vcf # Third alternative set
├── snp_profiles/ # Human-readable SNP profiles
│ ├── set_1_profile.txt # Profile for set 1
│ ├── set_2_profile.txt # Profile for set 2
│ └── set_3_profile.txt # Profile for set 3
├── summary.tsv # Comprehensive statistics for each SNP set
├── run_info.txt # Detailed execution information
├── output_structure.txt # Output organization guide
├── best_set.vcf # Optimal set with highest entropy
└── best_set_profile.txt # Profile for the optimal SNP set
A tab-separated values file containing comprehensive statistics for each generated SNP set:
Column | Description |
---|---|
set_index |
1-based index of the SNP set (1, 2, 3, ...) |
output_vcf |
Path to the corresponding VCF file (e.g., "minimal_set_1.vcf") |
num_snps |
Number of SNPs in the set |
min_distance |
Achieved minimum Hamming distance between all sample pairs |
shannon_entropy |
Shannon entropy score measuring chromosome distribution balance |
The Shannon entropy indicates how well SNPs are distributed across chromosomes - higher values indicate more balanced distribution, which is often desirable for genetic studies.
Each profile file (set_N_profile.txt
) provides a compact, human-readable representation of SNP sets:
Header Section:
- SNP set identification and chromosome distribution summary
- Format:
#SNP Set N Profile
and#Chromosome Distribution: Chr1(2); Chr2(3); Chr3(1)
Sample Profiles: Each sample is represented as a line showing all SNP genotypes in the format:
Sample_name: Chr1:Position1(Genotype1) Chr2:Position2(Genotype2) ...
Genotype Format:
- Diploid:
AA
,AT
,TT
(homozygous reference, heterozygous, homozygous alternate) - Triploid:
AAA
,AAT
,TTT
(no separators between alleles) - Missing:
--
(for diploid) or---
(for triploid) - Polyploid: Concatenated alleles without separators
Example Profile Content:
#SNP Set 1 Profile
#Chromosome Distribution: Chr1(2); Chr2(1); Chr3(2)
--------------------------------------------------
Sample_001: Chr1:12345(AA) Chr1:67890(TT) Chr2:11111(AT) Chr3:22222(AA) Chr3:33333(TT)
Sample_002: Chr1:12345(AT) Chr1:67890(AT) Chr2:11111(AA) Chr3:22222(AT) Chr3:33333(AT)
The optimal SNP set automatically selected based on highest Shannon entropy:
best_set.vcf
: Standard VCF format of the optimal SNP setbest_set_profile.txt
: Human-readable profile of the optimal set (same format as other profiles)
Comprehensive execution information including:
- Program version and build details
- Input file specifications and parameters
- Output statistics and timing information
- System memory usage and thresholds
- Complete configuration used for the run
Detailed explanation of all output files and their purposes, serving as a reference for understanding the results.
python mdsearch.py input.vcf results -we 0.8 -wm 0.2
Control the balance between entropy (-we
) and MAF (-wm
) components in SNP selection.
python mdsearch.py input.vcf results -ns unlimited
Find all possible disjoint discriminatory SNP sets.
python mdsearch.py input.vcf results -F json -q
Generate JSON-formatted logs for automated pipeline integration.
Run the comprehensive test suite:
pytest -v
The test suite covers genotype handling, distance calculations, CLI validation, and VCF format compliance.
This project is open-source and available under the MIT License.
Note: MDSearch today represents a major advancement over the version described in the original publication, featuring extensive improvements and new capabilities. If you use MDSearch in your work, please cite the original article.