ChoCallate (Chorus of Callers) is a high-performance, automated pipeline for consensus-based variant calling that combines multiple variant callers to produce robust, high-confidence single-nucleotide variants (SNVs) and indels (INDELs).
ChoCallate addresses a critical challenge in variant calling: individual variant callers can produce different results for the same genomic data, leading to uncertainty in variant identification. By implementing a consensus-driven approach, ChoCallate combines results from multiple state-of-the-art variant callers and applies configurable consensus rules to generate reliable, high-quality variant calls.
- Consensus-driven approach: Combines multiple variant callers using configurable consensus rules
- Ploidy flexibility: Supports both diploid and polyploid species with automatic caller selection
- Multiple consensus types: Majority rule, n-1 consensus, and full consensus options
- Dual input support: Processes both FASTQ (raw reads) and BAM (pre-aligned) files, allowing flexible integration of sequencing data at different analysis stages
- Flexible input compatibility: Works with GBS (Genotyping-by-Sequencing) and WGS data
- Parallel processing: Efficient parallel execution for optimal performance
- Configurable quality filtering: Multiple filtering steps based on coverage, base quality, and SNP quality
- Comprehensive logging: Structured JSON and text logging with detailed execution tracking and performance monitoring
- Smart cleanup: Configurable cleanup options with debug mode preservation
- BCF-native processing: Uses compressed BCF format throughout the pipeline for optimal performance
- Optional single-file output: Merge per-sample results into single multi-sample BCF
# Clone the repository
git clone https://github.com/alermol/ChoCallate.git
cd ChoCallate
# Set up the Conda environment
conda env create -f environment.yaml
conda activate ChoCallate
# Run the pipeline on test data
bash run_test.sh
# Optional: Clean up test output
bash cleanup.sh
Note: The test script expects test data in the test_data/
directory with the following files:
arth_chr1.fasta.gz
— Reference genome (compressed FASTA)test_reads_R1.fq.gz
— Paired-end read 1 (FASTQ)test_reads_R2.fq.gz
— Paired-end read 2 (FASTQ)test_reads_SE.fq.gz
— Single-end reads (FASTQ)sample1.bam
— Example BAM file for BAM input mode
The directory structure should look like:
nextflow run main.nf \
--reference_genome /path/to/reference.fasta \
--reference_index /path/to/reference_index \
--samples_tsv /path/to/samples.tsv
BAM input example (no Bowtie2 index required):
nextflow run main.nf \
--reference_genome /path/to/reference.fasta \
--input_format bam \
--reads_type pe \
--samples_tsv /path/to/samples_bam.tsv
Get help and version information:
# Show version information
nextflow run main.nf --version
# Show help
nextflow run main.nf --help
Caller | Diploid Support | Polyploid Support |
---|---|---|
bcftools | ✅ | ❌ |
GATK4 | ✅ | ✅ |
FreeBayes | ✅ | ✅ |
SNVer | ✅ | ✅ |
VarDict | ✅ | ❌ |
- Alignment: Bowtie2-based read alignment with quality filtering and BAM preparation
- Coverage Analysis: Generate coverage information for targeted variant calling
- Zero BCF Generation: Create position-template (zero) BCF with all covered positions
- Variant Calling: Parallel execution of selected variant callers
- Consensus Generation: Merges results using configurable consensus rules with Python-based SQLite processing
- Optional Merge Step: When enabled, merges all samples' BCFs into single final SNP/INDEL BCFs
- Output: Final compressed BCF files for SNPs and INDELs
Parameter | Required | Default | Description |
---|---|---|---|
--reference_genome |
✅ | - | Reference genome in FASTA format (supports gzipped) |
--reference_index |
✅ for input_format=fastq |
- | Bowtie2 index prefix for the reference genome (not required for BAM input) |
--samples_tsv |
✅ | input.tsv |
TSV file with sample information |
--input_format |
✅ | fastq |
Input files format: fastq or bam |
Parameter | Default | Description |
---|---|---|
--outdir |
ChoCallate_output |
Output directory for results |
Parameter | Default | Description |
---|---|---|
--min_coverage |
5 |
Minimum position coverage depth for variant calling |
--min_base_quality |
5 |
Minimum base quality for variant calling |
--min_map_qual |
5 |
Minimum mapping quality for read filtering |
--min_snp_qual |
5 |
Minimum variant quality threshold |
Parameter | Default | Choices | Description |
---|---|---|---|
--input_format |
fastq |
fastq , bam |
Selects whether samples TSV lists FASTQ reads or BAM files |
--reads_type |
pe |
pe , se , mx |
Read type: paired-end, single-end, or mixed |
--reads_source |
gbs |
gbs , wgs |
Data source: GBS or whole genome sequencing |
--ploidy |
2 |
≥2 |
Ploidy level of the organism |
Parameter | Default | Description |
---|---|---|
--effective_callers |
- |
Comma-separated list of variant callers to use (case-insensitive). Use - for automatic selection based on ploidy. |
--cons_type |
mj |
Consensus type: mj (majority), n1 (n-1), fc (full consensus) |
Parameter | Default | Description |
---|---|---|
--bowtie2_cpu |
10 |
Number of threads for Bowtie2 alignment |
--bowtie2_forks |
1 |
Number of parallel Bowtie2 processes |
--calling_forks |
1 |
Number of parallel variant calling processes |
--zero_bcf_cpu |
1 |
Number of threads for zero BCF generation |
--zero_bcf_forks |
1 |
Number of parallel zero BCF processes |
--cons_cpus |
5 |
Number of threads for consensus generation |
--cons_forks |
1 |
Number of parallel consensus processes |
--bcftools_cpu |
1 |
Number of threads for bcftools |
--vardict_cpu |
1 |
Number of threads for VarDict |
--merge_bcfs_cpus |
1 |
Number of threads for BCF merge step |
Parameter | Default | Description |
---|---|---|
--win_size |
1000000 |
Window size (in bp) for parallel consensus generation |
--debug |
false |
Keep working directory after pipeline completion |
--bowtie2_extra_args |
"" |
Extra arguments passed directly to Bowtie2 during alignment (use as is) |
--bcftools_mpileup_extra_args |
"" |
Extra arguments appended to bcftools mpileup |
--bcftools_call_extra_args |
"" |
Extra arguments appended to bcftools call |
--freebayes_extra_args |
"" |
Extra arguments appended to freebayes |
--gatk4_extra_args |
"" |
Extra arguments appended to gatk HaplotypeCaller |
--snver_extra_args |
"" |
Extra arguments appended to snver |
--vardict_extra_args |
"" |
Extra arguments appended to vardict-java |
--bcftools_merge_extra_args |
"" |
Extra arguments appended to bcftools merge |
--merge_bcfs_forks |
1 |
Number of parallel merge processes |
--single_file |
false |
If true , output one merged pair of final BCFs |
Parameter | Default | Description |
---|---|---|
--enable_sample_cleanup |
true |
Enable/disable sample-specific cleanup (false in debug mode) |
--cleanup_intermediate_bam |
true |
Remove intermediate BAM files (false in debug mode) |
--cleanup_intermediate_bcf |
true |
Remove intermediate BCF files (false in debug mode) |
--cleanup_intermediate_subfolders |
true |
Remove intermediate subfolders (false in debug mode) |
--cleanup_input_symlinks |
true |
Remove symlinks to input files (false in debug mode) |
Note: The actual default values are dynamically set based on debug mode. When --debug
is false (production mode), cleanup is enabled. When --debug
is true, cleanup is disabled to preserve intermediate files for analysis.
Parameter | Default | Choices | Description |
---|---|---|---|
--log_level |
INFO |
DEBUG , INFO , WARN , ERROR , FATAL |
Logging level for pipeline execution |
--log_format |
json |
json , text , both |
Log output format |
--log_timestamp |
true |
true , false |
Include timestamps in logs |
--log_process |
true |
true , false |
Include process names in logs |
--log_sample |
true |
true , false |
Include sample IDs in logs |
--log_file |
ChoCallate.log |
- | Main log file path |
--log_error_file |
ChoCallate_errors.log |
- | Error log file path |
Parameter | Default | Description |
---|---|---|
--help |
false |
Show help message and exit |
--version |
false |
Show version information and exit |
mj
(Majority Rule): Variant is called if majority of callers identify itn1
(N-1 Consensus): Variant is called if n-1 callers identify it (where n is total number of callers)fc
(Full Consensus): Variant is called only if all callers identify it
The consensus generation uses a sophisticated approach:
- Zero BCF Integration: All covered positions from the zero BCF are included in the final output
- SQLite Processing: Python scripts use SQLite databases for efficient variant comparison and consensus calculation
- Window-based Processing: Genomic regions are processed in parallel using configurable window sizes
- Quality Filtering: Variants are filtered based on quality scores and caller agreement
When --effective_callers
is set to -
(default), ChoCallate automatically selects appropriate callers among available:
- Diploid (ploidy=2): Uses
bcftools,gatk,freebayes,snver,vardict
- Polyploid (ploidy>2): Uses
gatk,freebayes,snver
(polyploid-compatible callers only)
# Diploid species (default)
nextflow run main.nf \
--reference_genome /path/to/reference.fasta \
--reference_index /path/to/reference_index \
--samples_tsv /path/to/samples.tsv \
--ploidy 2 \
--cons_type mj
# Polyploid species
nextflow run main.nf \
--reference_genome /path/to/reference.fasta \
--reference_index /path/to/reference_index \
--samples_tsv /path/to/samples.tsv \
--ploidy 4 \
--cons_type n1 \
--effective_callers gatk,freebayes,snver
The structure of --samples_tsv
depends on two parameters:
--input_format
:fastq
(raw reads) orbam
(pre-aligned)--reads_type
:pe
,se
, ormx
(applies to FASTQ mode)
Notes (applies to all modes):
- No header line is expected; do not include a header row.
- Fields must be separated by a single TAB character (TSV), not spaces or commas.
Provide 4 columns per sample: sample_id
, R1
, R2
, SE
.
- Required columns by
--reads_type
:pe
: columns 1,2,3 required (R1,R2). Column 4 can be-
.se
: columns 1 and 4 required (SE). Columns 2 and 3 can be-
.mx
: all 4 columns required (R1,R2,SE).
Examples:
# reads_type=pe
sample1 /path/R1.fq.gz /path/R2.fq.gz -
# reads_type=se
sample2 - - /path/SE.fq.gz
# reads_type=mx
sample3 /path/R1.fq.gz /path/R2.fq.gz /path/SE.fq.gz
Accepted read formats: .fq.gz
, .fastq.gz
, .fq
, .fastq
.
Provide at least 2 columns per sample: sample_id
, bam_path
. Columns 3–4 are ignored.
--reads_type
is accepted but does not affect BAM mode parsing.
Example:
sample1 /abs/path/sample1.bam x x
Notes:
- Column 2 must be a valid
.bam
file.
- Input reads (FASTQ mode):
.fq.gz
,.fastq.gz
,.fq
,.fastq
- Reference genome:
.fasta
,.fa
,.fna
(gzipped or ungzipped) - Variant caller output:
.bcf
(compressed BCF format) - Final output:
.bcf
(compressed BCF format)
- Format: FASTA (supports both compressed and uncompressed)
- Index: Pre-built Bowtie2 index (required only for
--input_format fastq
) - Path: Absolute paths required
Default (per-sample outputs):
ChoCallate_output/
├── sample1/
│ ├── sample1.snps.bcf # Final SNPs BCF (compressed)
│ └── sample1.indels.bcf # Final INDELs BCF (compressed)
├── sample2/
│ ├── sample2.snps.bcf
│ └── sample2.indels.bcf
├── ChoCallate_errors.log # Error log for the entire pipeline
├── ChoCallate.log # Main log file for the pipeline
├── pipeline_report.html # Pipeline summary report (HTML)
├── timeline_report.html # Timeline of process execution (HTML)
└── trace.txt # Detailed process trace file
Single-file mode (--single_file
):
ChoCallate_output/
├── final.snps.bcf # Merged SNPs across all samples
├── final.indels.bcf # Merged INDELs across all samples
├── ChoCallate_errors.log
├── ChoCallate.log
├── pipeline_report.html
├── timeline_report.html
└── trace.txt
nextflow run main.nf \
--reference_genome /path/to/reference.fasta \
--reference_index /path/to/reference_index \
--samples_tsv /path/to/samples.tsv \
--min_coverage 10 \
--min_base_quality 30 \
--min_map_qual 20 \
--min_snp_qual 30
nextflow run main.nf \
--reference_genome /path/to/reference.fasta \
--reference_index /path/to/reference_index \
--samples_tsv /path/to/samples.tsv \
--bowtie2_cpu 16 \
--cons_cpus 8 \
--win_size 2000000
For large genomes or high read counts, adjust memory allocation:
# Replace N with desired RAM in GB
sed -i 's/-Xmx1g/-XmxNg/' $CONDA_PREFIX/bin/snver
sed -i 's/-Xmx8g/-XmxNg/' $CONDA_PREFIX/bin/vardict-java
ChoCallate/
├── main.nf # Main Nextflow pipeline script
├── nextflow.config # Pipeline configuration
├── environment.yaml # Conda environment specification
├── LICENSE # MIT License file
├── functions/ # Utility functions
│ ├── utils.nf # Parameter validation functions
│ ├── logging.nf # Logging utilities
│ ├── help_version.nf # Help and version display module
│ ├── calling.nf # Variant calling workflow
│ ├── prepare_bam.nf # BAM preparation workflow
│ ├── coverage_generation.nf # Coverage analysis workflow
│ ├── create_fai_index.nf # FASTA index creation
│ ├── create_seq_dict.nf # Sequence dictionary creation
│ ├── generate_zero_bcf.nf # Zero BCF generation workflow
│ ├── generate_consensus.nf # Consensus generation workflow
│ ├── merge_bcfs.nf # Merge per-sample BCFs into single outputs
│ └── cleanup_sample_temp.nf # Sample cleanup workflow
├── bin/ # Pipeline scripts and variant caller wrappers
│ ├── bcftools_caller.sh # BCFtools variant calling
│ ├── gatk4_caller.sh # GATK4 variant calling
│ ├── freebayes_caller.sh # FreeBayes variant calling
│ ├── snver_caller.sh # SNVer variant calling
│ ├── vardict_caller.sh # VarDict variant calling
│ ├── consensus_generation.sh # Consensus generation script
│ ├── prepare_bam.sh # BAM preparation and alignment script
│ ├── process_snps.py # Python script for SNPs consensus
│ └── process_indels.py # Python script for indels consensus
├── run_test.sh # Test execution script
├── cleanup.sh # Test cleanup script
└── README.md # This file
All dependencies are managed via Conda:
# Core variant callers
- freebayes>=1.3.9
- gatk4=4.6.*
- snver=0.5.3
- vardict-java=1.8.3
- bcftools>=1.20
# Alignment and processing
- bowtie2
- samtools>=1.21
- bedtools
- bedops>=2.4.42
# Pipeline framework
- nextflow
- python
- tabix>=1.11
- parallel
- Memory errors: Increase memory allocation for SNVer/VarDict
- Disk space: Monitor available disk space for intermediate files
- Path issues: Use absolute paths for input files
nextflow run main.nf \
--reference_genome /path/to/reference.fasta \
--reference_index /path/to/reference_index \
--samples_tsv /path/to/samples.tsv \
--debug \
--log_level DEBUG
Debug mode preserves all intermediate files for analysis.
# Disable cleanup for debugging
nextflow run main.nf \
--reference_genome /path/to/reference.fasta \
--reference_index /path/to/reference_index \
--samples_tsv /path/to/samples.tsv \
--enable_sample_cleanup false \
--debug
# Custom cleanup configuration
nextflow run main.nf \
--reference_genome /path/to/reference.fasta \
--reference_index /path/to/reference_index \
--samples_tsv samples.tsv \
--cleanup_intermediate_bam false \
--cleanup_intermediate_bcf true
APA Style:
Ermolaev, A. (2025). ChoCallate: Consensus variant calling pipeline [Computer software]. GitHub. https://github.com/alermol/ChoCallate
BibTeX:
@software{ChoCallate,
author = {Ermolaev, A.},
title = {ChoCallate: Consensus variant calling pipeline},
url = {https://github.com/alermol/ChoCallate},
year = {2025}
}
ChoCallate is actively developed with a clear vision for future enhancements. Here's a roadmap for upcoming versions:
- Add New Germline Variant Callers
- Add New Short Read Mapping Tools
- Add Somatic Variant Callers
- Add Long-Read Variant Callers
- Add Long-Read Mapping Tools
- Add AI-Powered Features
- ML-based automatic consensus generation
- AI-powered variant quality assessment
- Add Containerized Solution
- Performance Optimization: Implement advanced strategies to significantly reduce pipeline runtime
- Error Handling: Improved error recovery and user feedback
- New Variant Callers: Integration of cutting-edge tools
- Quality Metrics: Enhanced quality assessment and reporting
- Format Support: Additional input/output format compatibility
We welcome contributions from the community! Here's how you can help:
- Core Pipeline: Nextflow workflow optimization
- Variant Callers: Integration of new variant calling tools
- Consensus Algorithms: Improved consensus generation methods
- Quality Control: Enhanced quality assessment tools
- Documentation: User guides and technical documentation
- Fork the repository
- Create a feature branch
- Implement your changes
- Add documentation
- Submit a pull request
MIT License - see LICENSE file for details.
Need help? Open an issue on GitHub or check our troubleshooting guide above.