ChoCallate 🍫

ChoCallate (Chorus of Callers) is a high-performance, automated pipeline for consensus-based variant calling that combines multiple variant callers to produce robust, high-confidence single-nucleotide variants (SNVs) and indels (INDELs).

What is ChoCallate?

ChoCallate addresses a critical challenge in variant calling: individual variant callers can produce different results for the same genomic data, leading to uncertainty in variant identification. By implementing a consensus-driven approach, ChoCallate combines results from multiple state-of-the-art variant callers and applies configurable consensus rules to generate reliable, high-quality variant calls.

Key Features

Consensus-driven approach: Combines multiple variant callers using configurable consensus rules
Ploidy flexibility: Supports both diploid and polyploid species with automatic caller selection
Multiple consensus types: Majority rule, n-1 consensus, and full consensus options
Dual input support: Processes both FASTQ (raw reads) and BAM (pre-aligned) files, allowing flexible integration of sequencing data at different analysis stages
Flexible input compatibility: Works with GBS (Genotyping-by-Sequencing) and WGS data
Parallel processing: Efficient parallel execution for optimal performance
Configurable quality filtering: Multiple filtering steps based on coverage, base quality, and SNP quality
Comprehensive logging: Structured JSON and text logging with detailed execution tracking and performance monitoring
Smart cleanup: Configurable cleanup options with debug mode preservation
BCF-native processing: Uses compressed BCF format throughout the pipeline for optimal performance
Optional single-file output: Merge per-sample results into single multi-sample BCF

Quick Start

1. Installation

# Clone the repository
git clone https://github.com/alermol/ChoCallate.git
cd ChoCallate

# Set up the Conda environment
conda env create -f environment.yaml
conda activate ChoCallate

2. Test Run

# Run the pipeline on test data
bash run_test.sh

# Optional: Clean up test output
bash cleanup.sh

Note: The test script expects test data in the test_data/ directory with the following files:

arth_chr1.fasta.gz — Reference genome (compressed FASTA)
test_reads_R1.fq.gz — Paired-end read 1 (FASTQ)
test_reads_R2.fq.gz — Paired-end read 2 (FASTQ)
test_reads_SE.fq.gz — Single-end reads (FASTQ)
sample1.bam — Example BAM file for BAM input mode

The directory structure should look like:

3. Basic Usage

nextflow run main.nf \
    --reference_genome /path/to/reference.fasta \
    --reference_index /path/to/reference_index \
    --samples_tsv /path/to/samples.tsv

BAM input example (no Bowtie2 index required):

nextflow run main.nf \
    --reference_genome /path/to/reference.fasta \
    --input_format bam \
    --reads_type pe \
    --samples_tsv /path/to/samples_bam.tsv

4. Command-Line Help

Get help and version information:

# Show version information
nextflow run main.nf --version

# Show help
nextflow run main.nf --help

Pipeline Architecture

Supported Variant Callers

Caller	Diploid Support	Polyploid Support
bcftools	✅	❌
GATK4	✅	✅
FreeBayes	✅	✅
SNVer	✅	✅
VarDict	✅	❌

Workflow Scheme

Alignment: Bowtie2-based read alignment with quality filtering and BAM preparation
Coverage Analysis: Generate coverage information for targeted variant calling
Zero BCF Generation: Create position-template (zero) BCF with all covered positions
Variant Calling: Parallel execution of selected variant callers
Consensus Generation: Merges results using configurable consensus rules with Python-based SQLite processing
Optional Merge Step: When enabled, merges all samples' BCFs into single final SNP/INDEL BCFs
Output: Final compressed BCF files for SNPs and INDELs

Configuration

Essential Parameters

Parameter	Required	Default	Description
`--reference_genome`	✅	-	Reference genome in FASTA format (supports gzipped)
`--reference_index`	✅ for `input_format=fastq`	-	Bowtie2 index prefix for the reference genome (not required for BAM input)
`--samples_tsv`	✅	`input.tsv`	TSV file with sample information
`--input_format`	✅	`fastq`	Input files format: `fastq` or `bam`

Input/Output Parameters

Parameter	Default	Description
`--outdir`	`ChoCallate_output`	Output directory for results

Quality and Filtering Parameters

Parameter	Default	Description
`--min_coverage`	`5`	Minimum position coverage depth for variant calling
`--min_base_quality`	`5`	Minimum base quality for variant calling
`--min_map_qual`	`5`	Minimum mapping quality for read filtering
`--min_snp_qual`	`5`	Minimum variant quality threshold

Data Type Parameters

Parameter	Default	Choices	Description
`--input_format`	`fastq`	`fastq`, `bam`	Selects whether samples TSV lists FASTQ reads or BAM files
`--reads_type`	`pe`	`pe`, `se`, `mx`	Read type: paired-end, single-end, or mixed
`--reads_source`	`gbs`	`gbs`, `wgs`	Data source: GBS or whole genome sequencing
`--ploidy`	`2`	`≥2`	Ploidy level of the organism

Variant Calling Parameters

Parameter	Default	Description
`--effective_callers`	`-`	Comma-separated list of variant callers to use (case-insensitive). Use `-` for automatic selection based on ploidy.
`--cons_type`	`mj`	Consensus type: `mj` (majority), `n1` (n-1), `fc` (full consensus)

Resource Allocation Parameters

Parameter	Default	Description
`--bowtie2_cpu`	`10`	Number of threads for Bowtie2 alignment
`--bowtie2_forks`	`1`	Number of parallel Bowtie2 processes
`--calling_forks`	`1`	Number of parallel variant calling processes
`--zero_bcf_cpu`	`1`	Number of threads for zero BCF generation
`--zero_bcf_forks`	`1`	Number of parallel zero BCF processes
`--cons_cpus`	`5`	Number of threads for consensus generation
`--cons_forks`	`1`	Number of parallel consensus processes
`--bcftools_cpu`	`1`	Number of threads for bcftools
`--vardict_cpu`	`1`	Number of threads for VarDict
`--merge_bcfs_cpus`	`1`	Number of threads for BCF merge step

Processing Parameters

Parameter	Default	Description
`--win_size`	`1000000`	Window size (in bp) for parallel consensus generation
`--debug`	`false`	Keep working directory after pipeline completion
`--bowtie2_extra_args`	`""`	Extra arguments passed directly to Bowtie2 during alignment (use as is)
`--bcftools_mpileup_extra_args`	`""`	Extra arguments appended to `bcftools mpileup`
`--bcftools_call_extra_args`	`""`	Extra arguments appended to `bcftools call`
`--freebayes_extra_args`	`""`	Extra arguments appended to `freebayes`
`--gatk4_extra_args`	`""`	Extra arguments appended to `gatk HaplotypeCaller`
`--snver_extra_args`	`""`	Extra arguments appended to `snver`
`--vardict_extra_args`	`""`	Extra arguments appended to `vardict-java`
`--bcftools_merge_extra_args`	`""`	Extra arguments appended to `bcftools merge`
`--merge_bcfs_forks`	`1`	Number of parallel merge processes
`--single_file`	`false`	If `true`, output one merged pair of final BCFs

Cleanup Configuration Parameters

Parameter	Default	Description
`--enable_sample_cleanup`	`true`	Enable/disable sample-specific cleanup (false in debug mode)
`--cleanup_intermediate_bam`	`true`	Remove intermediate BAM files (false in debug mode)
`--cleanup_intermediate_bcf`	`true`	Remove intermediate BCF files (false in debug mode)
`--cleanup_intermediate_subfolders`	`true`	Remove intermediate subfolders (false in debug mode)
`--cleanup_input_symlinks`	`true`	Remove symlinks to input files (false in debug mode)

Note: The actual default values are dynamically set based on debug mode. When --debug is false (production mode), cleanup is enabled. When --debug is true, cleanup is disabled to preserve intermediate files for analysis.

Logging Parameters

Parameter	Default	Choices	Description
`--log_level`	`INFO`	`DEBUG`, `INFO`, `WARN`, `ERROR`, `FATAL`	Logging level for pipeline execution
`--log_format`	`json`	`json`, `text`, `both`	Log output format
`--log_timestamp`	`true`	`true`, `false`	Include timestamps in logs
`--log_process`	`true`	`true`, `false`	Include process names in logs
`--log_sample`	`true`	`true`, `false`	Include sample IDs in logs
`--log_file`	`ChoCallate.log`	-	Main log file path
`--log_error_file`	`ChoCallate_errors.log`	-	Error log file path

Help and Version Parameters

Parameter	Default	Description
`--help`	`false`	Show help message and exit
`--version`	`false`	Show version information and exit

Consensus Types

mj (Majority Rule): Variant is called if majority of callers identify it
n1 (N-1 Consensus): Variant is called if n-1 callers identify it (where n is total number of callers)
fc (Full Consensus): Variant is called only if all callers identify it

Consensus Implementation

The consensus generation uses a sophisticated approach:

Zero BCF Integration: All covered positions from the zero BCF are included in the final output
SQLite Processing: Python scripts use SQLite databases for efficient variant comparison and consensus calculation
Window-based Processing: Genomic regions are processed in parallel using configurable window sizes
Quality Filtering: Variants are filtered based on quality scores and caller agreement

Automatic Caller Selection

When --effective_callers is set to - (default), ChoCallate automatically selects appropriate callers among available:

Diploid (ploidy=2): Uses bcftools,gatk,freebayes,snver,vardict
Polyploid (ploidy>2): Uses gatk,freebayes,snver (polyploid-compatible callers only)

Ploidy and Caller Selection Examples

# Diploid species (default)
nextflow run main.nf \
    --reference_genome /path/to/reference.fasta \
    --reference_index /path/to/reference_index \
    --samples_tsv /path/to/samples.tsv \
    --ploidy 2 \
    --cons_type mj

# Polyploid species
nextflow run main.nf \
    --reference_genome /path/to/reference.fasta \
    --reference_index /path/to/reference_index \
    --samples_tsv /path/to/samples.tsv \
    --ploidy 4 \
    --cons_type n1 \
    --effective_callers gatk,freebayes,snver

Input Data Structure

Samples TSV Format

The structure of --samples_tsv depends on two parameters:

--input_format: fastq (raw reads) or bam (pre-aligned)
--reads_type: pe, se, or mx (applies to FASTQ mode)

Notes (applies to all modes):

No header line is expected; do not include a header row.
Fields must be separated by a single TAB character (TSV), not spaces or commas.

FASTQ mode (`--input_format fastq`)

Provide 4 columns per sample: sample_id, R1, R2, SE.

Required columns by --reads_type:
- pe: columns 1,2,3 required (R1,R2). Column 4 can be -.
- se: columns 1 and 4 required (SE). Columns 2 and 3 can be -.
- mx: all 4 columns required (R1,R2,SE).

Examples:

# reads_type=pe
sample1    /path/R1.fq.gz    /path/R2.fq.gz    -

# reads_type=se
sample2    -                 -                 /path/SE.fq.gz

# reads_type=mx
sample3    /path/R1.fq.gz    /path/R2.fq.gz    /path/SE.fq.gz

Accepted read formats: .fq.gz, .fastq.gz, .fq, .fastq.

BAM mode (`--input_format bam`)

Provide at least 2 columns per sample: sample_id, bam_path. Columns 3–4 are ignored.

--reads_type is accepted but does not affect BAM mode parsing.

Example:

sample1    /abs/path/sample1.bam    x    x

Notes:

Column 2 must be a valid .bam file.

File Format Support

Input reads (FASTQ mode): .fq.gz, .fastq.gz, .fq, .fastq
Reference genome: .fasta, .fa, .fna (gzipped or ungzipped)
Variant caller output: .bcf (compressed BCF format)
Final output: .bcf (compressed BCF format)

Reference Requirements

Format: FASTA (supports both compressed and uncompressed)
Index: Pre-built Bowtie2 index (required only for --input_format fastq)
Path: Absolute paths required

Output Structure

Default (per-sample outputs):

ChoCallate_output/
├── sample1/
│   ├── sample1.snps.bcf      # Final SNPs BCF (compressed)
│   └── sample1.indels.bcf    # Final INDELs BCF (compressed)
├── sample2/
│   ├── sample2.snps.bcf
│   └── sample2.indels.bcf
├── ChoCallate_errors.log         # Error log for the entire pipeline
├── ChoCallate.log                # Main log file for the pipeline
├── pipeline_report.html          # Pipeline summary report (HTML)
├── timeline_report.html          # Timeline of process execution (HTML)
└── trace.txt                     # Detailed process trace file

Single-file mode (--single_file):

ChoCallate_output/
├── final.snps.bcf               # Merged SNPs across all samples
├── final.indels.bcf             # Merged INDELs across all samples
├── ChoCallate_errors.log
├── ChoCallate.log
├── pipeline_report.html
├── timeline_report.html
└── trace.txt

Advanced Configuration

Quality Filtering

nextflow run main.nf \
    --reference_genome /path/to/reference.fasta \
    --reference_index /path/to/reference_index \
    --samples_tsv /path/to/samples.tsv \
    --min_coverage 10 \
    --min_base_quality 30 \
    --min_map_qual 20 \
    --min_snp_qual 30

Resource Allocation

nextflow run main.nf \
    --reference_genome /path/to/reference.fasta \
    --reference_index /path/to/reference_index \
    --samples_tsv /path/to/samples.tsv \
    --bowtie2_cpu 16 \
    --cons_cpus 8 \
    --win_size 2000000

Memory Optimization

For large genomes or high read counts, adjust memory allocation:

# Replace N with desired RAM in GB
sed -i 's/-Xmx1g/-XmxNg/' $CONDA_PREFIX/bin/snver
sed -i 's/-Xmx8g/-XmxNg/' $CONDA_PREFIX/bin/vardict-java

Project Structure

ChoCallate/
├── main.nf                      # Main Nextflow pipeline script
├── nextflow.config              # Pipeline configuration
├── environment.yaml             # Conda environment specification
├── LICENSE                      # MIT License file
├── functions/                   # Utility functions
│   ├── utils.nf                 # Parameter validation functions
│   ├── logging.nf               # Logging utilities
│   ├── help_version.nf          # Help and version display module
│   ├── calling.nf               # Variant calling workflow
│   ├── prepare_bam.nf           # BAM preparation workflow
│   ├── coverage_generation.nf   # Coverage analysis workflow
│   ├── create_fai_index.nf      # FASTA index creation
│   ├── create_seq_dict.nf       # Sequence dictionary creation
│   ├── generate_zero_bcf.nf     # Zero BCF generation workflow
│   ├── generate_consensus.nf    # Consensus generation workflow
│   ├── merge_bcfs.nf            # Merge per-sample BCFs into single outputs
│   └── cleanup_sample_temp.nf   # Sample cleanup workflow
├── bin/                         # Pipeline scripts and variant caller wrappers
│   ├── bcftools_caller.sh       # BCFtools variant calling
│   ├── gatk4_caller.sh          # GATK4 variant calling
│   ├── freebayes_caller.sh      # FreeBayes variant calling
│   ├── snver_caller.sh          # SNVer variant calling
│   ├── vardict_caller.sh        # VarDict variant calling
│   ├── consensus_generation.sh  # Consensus generation script
│   ├── prepare_bam.sh           # BAM preparation and alignment script
│   ├── process_snps.py          # Python script for SNPs consensus
│   └── process_indels.py        # Python script for indels consensus
├── run_test.sh                  # Test execution script
├── cleanup.sh                   # Test cleanup script
└── README.md                    # This file

Dependencies

All dependencies are managed via Conda:

# Core variant callers
- freebayes>=1.3.9
- gatk4=4.6.*
- snver=0.5.3
- vardict-java=1.8.3
- bcftools>=1.20

# Alignment and processing
- bowtie2
- samtools>=1.21
- bedtools
- bedops>=2.4.42

# Pipeline framework
- nextflow
- python
- tabix>=1.11
- parallel

Troubleshooting

Common Issues

Memory errors: Increase memory allocation for SNVer/VarDict
Disk space: Monitor available disk space for intermediate files
Path issues: Use absolute paths for input files

Debug Mode

nextflow run main.nf \
    --reference_genome /path/to/reference.fasta \
    --reference_index /path/to/reference_index \
    --samples_tsv /path/to/samples.tsv \
    --debug \
    --log_level DEBUG

Debug mode preserves all intermediate files for analysis.

Cleanup Options

# Disable cleanup for debugging
nextflow run main.nf \
    --reference_genome /path/to/reference.fasta \
    --reference_index /path/to/reference_index \
    --samples_tsv /path/to/samples.tsv \
    --enable_sample_cleanup false \
    --debug

# Custom cleanup configuration
nextflow run main.nf \
    --reference_genome /path/to/reference.fasta \
    --reference_index /path/to/reference_index \
    --samples_tsv samples.tsv \
    --cleanup_intermediate_bam false \
    --cleanup_intermediate_bcf true

Citation

APA Style:
Ermolaev, A. (2025). ChoCallate: Consensus variant calling pipeline [Computer software]. GitHub. https://github.com/alermol/ChoCallate

BibTeX:

@software{ChoCallate,
  author = {Ermolaev, A.},
  title = {ChoCallate: Consensus variant calling pipeline},
  url = {https://github.com/alermol/ChoCallate},
  year = {2025}
}

Development Roadmap

ChoCallate is actively developed with a clear vision for future enhancements. Here's a roadmap for upcoming versions:

Add New Germline Variant Callers
Add New Short Read Mapping Tools
Add Somatic Variant Callers
Add Long-Read Variant Callers
Add Long-Read Mapping Tools
Add AI-Powered Features
- ML-based automatic consensus generation
- AI-powered variant quality assessment
Add Containerized Solution

Development Priorities

Performance Optimization: Implement advanced strategies to significantly reduce pipeline runtime
Error Handling: Improved error recovery and user feedback
New Variant Callers: Integration of cutting-edge tools
Quality Metrics: Enhanced quality assessment and reporting
Format Support: Additional input/output format compatibility

Contributing to Development

We welcome contributions from the community! Here's how you can help:

Development Areas

Core Pipeline: Nextflow workflow optimization
Variant Callers: Integration of new variant calling tools
Consensus Algorithms: Improved consensus generation methods
Quality Control: Enhanced quality assessment tools
Documentation: User guides and technical documentation

Getting Started

Fork the repository
Create a feature branch
Implement your changes
Add documentation
Submit a pull request

License

MIT License - see LICENSE file for details.

Need help? Open an issue on GitHub or check our troubleshooting guide above.

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
bin		bin
functions		functions
test_data		test_data
.gitignore		.gitignore
ChoCallate_scheme.png		ChoCallate_scheme.png
LICENSE		LICENSE
README.md		README.md
cleanup.sh		cleanup.sh
environment.yaml		environment.yaml
main.nf		main.nf
nextflow.config		nextflow.config
run_test.sh		run_test.sh

License

alermol/ChoCallate

Folders and files

Latest commit

History

Repository files navigation

ChoCallate 🍫

What is ChoCallate?

Key Features

Quick Start

1. Installation

2. Test Run

3. Basic Usage

4. Command-Line Help

Pipeline Architecture

Supported Variant Callers

Workflow Scheme

Configuration

Essential Parameters

Input/Output Parameters

Quality and Filtering Parameters

Data Type Parameters

Variant Calling Parameters

Resource Allocation Parameters

Processing Parameters

Cleanup Configuration Parameters

Logging Parameters

Help and Version Parameters

Consensus Types

Consensus Implementation

Automatic Caller Selection

Ploidy and Caller Selection Examples

Input Data Structure

Samples TSV Format

FASTQ mode (--input_format fastq)

BAM mode (--input_format bam)

File Format Support

Reference Requirements

Output Structure

Advanced Configuration

Quality Filtering

Resource Allocation

Memory Optimization

Project Structure

Dependencies

Troubleshooting

Common Issues

Debug Mode

Cleanup Options

Citation

Development Roadmap

Development Priorities

Contributing to Development

Development Areas

Getting Started

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages

FASTQ mode (`--input_format fastq`)

BAM mode (`--input_format bam`)