Skip to content

This script analyzes variant call format (VCF) files to identify genetic variants shared across multiple samples. Given a list of VCF paths, it reports variants present in ≥10-100% of samples (10% increments), showing chromosome, position, alleles, sample count, and percentage. Processes genotype data to ensure accurate variant presence detection

License

Notifications You must be signed in to change notification settings

gmboowa/shared_variant_analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Shared Variant Analyzer

This is a Perl script used to identify genetic variants consistently present across multiple samples in cohort studies or population-level analyses. It processes Variant Call Format (VCF) files to detect single-nucleotide variants (SNVs) & indels shared at defined frequency thresholds (10% to 100% in 10% increments). Unlike simple presence/absence tools, it performs genotype-aware parsing that accurately accounts for heterozygous calls (e.g., "0/1" genotypes) & filters out missing data ("./." entries). The tool generates comprehensive reports detailing variant positions (chromosome, coordinate), reference/alternate alleles & both absolute counts & percentages of samples containing each variant. Its threshold-based approach helps researchers identify core genomic elements in pathogen populations, conserved mutations in study cohorts, or transmission clusters in outbreak investigations.

The script features robust input validation, automatically verifying file paths & VCF integrity before analysis. Outputs are sorted by genomic position & prevalence frequency, facilitating downstream interpretation in tools like Excel or R. Designed for efficiency, it handles large variant sets through optimized hashing algorithms while maintaining low memory footprint. Applications range from identifying vaccine targets in viral quasispecies to detecting founder mutations in genetic epidemiology studies. The command-line interface supports integration into automated pipelines & its tab-separated output format enables seamless incorporation into genomic databases or visualization platforms. Particularly valuable for studies requiring variant prioritization based on ubiquity, this tool bridges the gap between raw variant calling and population-level biological interpretation.

1. Features

Multiple VCF file analysis

  1. Genotype-aware variant counting

  2. Threshold reporting (10-100% sample sharing)

  3. Comprehensive output with percentages

  4. Input validation and error checking

2. Requirements

  1. Perl 5.20+

  2. Perl modules: Getopt::Long, List::Util

3. Usage

perl shared_variant_analyzer.pl -i vcf_list.txt > variant_report.tsv

4. Input format


**vcf_list.txt**:


/path/to/sample1.vcf
/path/to/sample2.vcf
/path/to/sample3.vcf


5. Output columns

**Chrom	| Position |	Ref	| Alt | SampleCount	| Percentage**

License

MIT License.

About

This script analyzes variant call format (VCF) files to identify genetic variants shared across multiple samples. Given a list of VCF paths, it reports variants present in ≥10-100% of samples (10% increments), showing chromosome, position, alleles, sample count, and percentage. Processes genotype data to ensure accurate variant presence detection

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages