Skip to content

A framework for generating synthetic genomics data for the evaluation of tumor-only somatic variant calling algorithms.

License

Notifications You must be signed in to change notification settings

BiodataAnalysisGroup/synth4bench

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

synth4bench logo


Abstract

Somatic variant calling algorithms are essential for detecting genomic alterations associated with cancer. However, evaluating their performance can be challenging due to the lack of high-quality ground truth datasets. To address this issue, we developed a synthetic genomics data generation framework for benchmarking tumor-only somatic variant calling algorithms. We generated synthetic datasets based on the TP53 gene using the NEAT v3.3 simulator. Subsequently, we thoroughly evaluated the performance of variant calling algorithms including GATK-Mutect2, Freebayes, VarDict, VarScan, and LoFreq on these datasets, comparing results against the ground truth produced by NEAT. Synthetic datasets provide an excellent ground truth for studying the performance and behavior of somatic variant calling algorithms, enabling researchers to evaluate and improve their accuracy for cancer genomics applications.

Table of Contents


Motivation

Variant calling plays a critical role in identifying genetic lesions. In the case of low-frequency variants (≤10%), identification becomes more challenging due to the absence of ground truth datasets for reliable and consistent benchmarking.

Description of Framework

synth4bench schematic

Our framework addresses the challenge of variant calling, particularly for low-frequency variants (≤10%). The goal is to develop a reliable and consistent method for identifying genetic lesions in cancer-associated genomic alterations. The lack of ground truth datasets complicates benchmarking and evaluation. To overcome this, our framework includes:
  1. Synthetic Data Generation: Using the NEAT v3.3 simulator, we generate synthetic genomics data that mimics real genome sequences, serving as a ground truth.

  2. Benchmarking Variant Callers: We evaluate five somatic variant callers — GATK-Mutect2, Freebayes, VarDict, VarScan2, and LoFreq — using these synthetic datasets.


Data Download

All data are openly available on Zenodo. For specific instructions, refer to our User Guide.


Installation

  1. Create the Conda environment:

    conda env create -f environment.yml
    conda activate synth4bench
  2. Install NEAT v3.3:

    Download version v3.3.
    To call the main script:

    python gen_reads.py --help

    For further details, see the NEAT README included in the download.

  3. Install bam-readcount:

    Follow their installation instructions.
    After building, verify installation:

    build/bin/bam-readcount --help

    If you encounter issues during the make process, you can alternatively use the executable available here and place it in the bam-readcount/build/bin folder.

  4. Download VarScan Extra Script:

    The extra script vscan_pileup2cns2vcf.py for VarScan is available here.


Execution

Simply configure your parameters in the parameters.yaml file, then execute:

bash s4b_run.sh

This single command generates synthetic data, runs variant calling for all selected tools, and performs downstream analysis and plotting.

For full execution instructions, see our User Guide.


Documentation

For further documentation, visit the documentation page.


Contribute

We welcome and greatly appreciate any feedback or contributions!

If you have questions, please open an issue here or email sfragkoul@certh.gr.


Citation

Our work has been submitted to the bioRxiv preprint repository. If you use synth4bench, or any of our scripts/code, please cite:

S.-C. Fragkouli, N. Pechlivanis, A. Anastasiadou, G. Karakatsoulis, A. Orfanou, P. Kollia, A. Agathangelidis, and F. E. Psomopoulos, “Synth4bench: a framework for generating synthetic genomics data for the evaluation of tumor-only somatic variant calling algorithms.” 2024, doi:10.1101/2024.03.07.582313.


Related Publications

  • S.-C. Fragkouli, N. Pechlivanis, A. Anastasiadou, G. Karakatsoulis, A. Orfanou, P. Kollia, A. Agathangelidis, and F. Psomopoulos, synth4bench: Benchmarking Somatic Variant Callers – A Tale Unfolding In The Synthetic Genomics Feature Space, 23rd European Conference On Computational Biology (ECCB24), Sep 2024, Turku, Finland, doi: 10.5281/zenodo.14186509
  • S.-C. Fragkouli, N. Pechlivanis, A. Anastasiadou, G. Karakatsoulis, A. Orfanou, P. Kollia, A. Agathangelidis, and F. Psomopoulos, “Exploring Somatic Variant Callers' Behavior: A Synthetic Genomics Feature Space Approach”, ELIXIR AHM24, Jun 2024, Uppsala, Sweden, doi: 10.7490/f1000research.1119793.1
  • S.-C. Fragkouli, N. Pechlivanis, A. Orfanou, A. Anastasiadou, A. Agathangelidis and F. Psomopoulos, Synth4bench: a framework for generating synthetic genomics data for the evaluation of somatic variant calling algorithms, 17th Conference of Hellenic Society for Computational Biology and Bioinformatics (HSCBB), Oct 2023, Thessaloniki, Greece, doi:10.5281/zenodo.8432060
  • S.-C. Fragkouli, N. Pechlivanis, A. Agathangelidis and F. Psomopoulos, Synthetic Genomics Data Generation and Evaluation for the Use Case of Benchmarking Somatic Variant Calling Algorithms, 31st Conference in Intelligent Systems For Molecular Biology and the 22nd European Conference On Computational Biology (ISΜB-ECCB23), Jul 2023, Lyon, France, doi:10.7490/f1000research.1119575.1

About

A framework for generating synthetic genomics data for the evaluation of tumor-only somatic variant calling algorithms.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 93.2%
  • Shell 6.8%