Automated benchmarking of recombination detection methods using simulated and empirical datasets.
Associated publication: https://academic.oup.com/ve/article/9/2/vead066/7444193
sim.nf: Nextflow (DSL1) pipeline to simulate datasets with SANTA-SIM and benchmark multiple recombination detection methods (RDMs).empirical.nf: Nextflow pipeline to run RDMs on empirical FASTA datasets.processes.nf: Shared Nextflow processes for OpenRDP methods (rdp,maxchi,chimaera).nextflow.config: Default parameters, executors, and reporting/tracing configuration.bin/: Binaries and helper scripts for simulation parsing, conditions, and plotting inputs.src/: Rmd analysis notebooks, rendered HTML reports, and helper shell scripts for post-processing.data/: Input FASTA and SANTA-SIM XML templates plus parameter-sweep XMLs.figs/: Generated figures from the analysis notebooks.tests/: Python tests validating parsing and condition-calculation utilities.log.sh: A runnable “lab notebook” that documents the steps used for simulation, analysis, and empirical runs.
- Nextflow 22.10 (DSL1-compatible)
- Conda environment defined in
environment.yml
External tools used by the pipelines:
- SANTA-SIM (custom
bin/santa_bp.jarincluded) - PhiPack (
Profile) - 3SEQ (
bin/3seq_elfincluded) - GENECONV (
bin/geneconvincluded) - UCHIME via
vsearch - GMOS (
bin/gmosincluded) - OpenRDP (external; not included)
Install the conda environment if Nextflow does not create it automatically:
conda env create -f environment.yml
conda activate fredjaya-rec-bench-0.1.0Generate the 3SEQ p-value table once:
bin/3seq_elf -gen-p bin/p700 700Simulates FASTA datasets over parameter sweeps and runs multiple RDMs.
Key inputs:
--mode:performanceorscalability--seq: input FASTA for SANTA-SIM seeding--xml: SANTA-SIM XML template--out: output directory
Example:
nextflow run sim.nf \
--mode scalability \
--seq data/FP7_patient_037_allseqs.fasta \
--xml data/neutral.xml \
--out "$(pwd)"High-level steps in sim.nf:
S1_*filter FASTA and prepare SANTA-SIM inputsS3_*parameter sweep of XMLsS4_*run SANTA-SIMB*run RDMs: PhiPack Profile, 3SEQ, GENECONV, UCHIME, GMOSrdp/maxchi/chimaera(OpenRDP) are included at the end of the file
Runs RDMs on empirical datasets, producing raw tool outputs for downstream parsing and visualization.
Notes:
- Uses hardcoded input path glob and output directory inside
empirical.nf. Update these before running on a new system.
Contains reusable Nextflow processes for rdp, maxchi, and chimaera on an input FASTA channel (params.fa).
src/1_sim_stats.sh: generates simulation stats and derived CSVssrc/2_conditions.sh: computes TP/FP/TN/FN conditions for toolssrc/*.Rmd: analysis notebooks for performance, scaling, and empirical resultssrc/*.html: rendered notebooks
The log.sh file documents a full run from simulation through plots, including manual steps and troubleshooting notes.
Key helpers include:
S1_filter_fasta.py: removes gappy sequences for SANTA-SIMV1_santa_stats.py,V2_santa_bp.py,V3_sim_bp.R: derive true breakpoints and statsF1_addCondition_phiProfile.py,F2_addCondition_3SEQ.py,F3_addCondition_geneconv2.py: compute detection conditionsF3_concat_gc_outputs.py,F3_separate_seq_pairs.R: GENECONV parsingV4_sim_distances.R,V5_fasta_to_bpcounts.py: additional simulation metrics- Binaries:
3seq_elf,geneconv,gmos,santa_bp.jar
tests/ contains tests for parsing and condition-calculation utilities (e.g., 3SEQ and PhiPack helpers, trace parsing).
- Many paths are hardcoded and may need to be tweaked for future runs
- Configuration is not optimised for efficiency
- Start with
log.shto see the exact commands and order used in the original runs. - Use the provided
data/inputs anddata/xml/parameter-sweep templates to match published settings. - Verify tool versions in
environment.ymland in any external installations (OpenRDP, IQ-TREE, etc.).