HaploHyped VarAwareML

High-performance genomic data processing pipeline for machine learning. Converts VCF files to optimized HDF5 format with C++ acceleration and Blosc2 compression.

Features

559K variants/sec - C++ VCF parsing with vcfpp
6.5x compression - Blosc2 with LZ4/Zstandard
PyTorch integration - Custom Dataset classes with on-the-fly haplotype encoding
GPU support - CUDA-accelerated processing

Installation

Prerequisites

Linux (tested on Ubuntu 20.04+)
Conda or Mamba
GCC with C++11 support

Setup

# Clone repository
git clone https://github.com/Jaureguy760/HaploHyped-VarAwareML.git
cd HaploHyped-VarAwareML

# Create conda environment
conda env create -f environment.yml
conda activate HaploHyped-VarAwareML

# Build C++ components and install
chmod +x build.sh
./build.sh

# Verify installation
pytest tests/ -v

Manual Installation

If you prefer manual setup:

# Create environment
conda env create -f environment.yml
conda activate HaploHyped-VarAwareML

# Build C++ VCF parser
cd cpp
mkdir -p build && cd build
cmake .. -DCMAKE_PREFIX_PATH="$CONDA_PREFIX"
make -j$(nproc)
cd ../..

# Build Python bindings
cd cpp
g++ -O3 -Wall -shared -fPIC -std=c++11 \
    $(python3 -m pybind11 --includes) parse_vcf.cpp \
    -o parse_vcf$(python3-config --extension-suffix) \
    -I"${CONDA_PREFIX}/include" \
    -L"${CONDA_PREFIX}/lib" \
    -lhts -lz -Wl,-rpath,"${CONDA_PREFIX}/lib"
cd ..

# Install package
pip install -e .

Usage

VCF to HDF5 Conversion

vcf_to_h5 \
    --cohort_name my_study \
    --vcf /path/to/vcf_files \
    --outdir /path/to/output \
    --sample_list samples.txt \
    --cores 10

Reference Genome Encoding

fasta_encoder \
    --fasta reference.fasta \
    --outdir /path/to/output \
    --cores 22

Python API

from datasets import RandomHaplotypeDataset
from torch.utils.data import DataLoader

dataset = RandomHaplotypeDataset(
    bed_file='regions.bed',
    hdf5_genotype_file='cohort.h5',
    hdf5_reference_file='reference_genome.h5',
    samples_file='samples.txt',
    seq_length=1000
)

dataloader = DataLoader(dataset, batch_size=32, num_workers=4)

for hap1, hap2 in dataloader:
    predictions = model(hap1, hap2)

Performance

Operation	Speed
VCF Parsing	559K variants/sec
HDF5 Write	256K records/sec
HDF5 Read	342K records/sec
Compression	6.5x ratio

Benchmarks: Whole genome (3M variants) processes in ~6s parse, ~12s write on Intel Xeon with NVMe SSD.

Architecture

VCF Input → C++ Parser (vcfpp) → NumPy Arrays → HDF5 (Blosc2) → PyTorch Dataset

Project Structure

├── cpp/              # C++ VCF parser (vcfpp, pybind11)
│   ├── parse_vcf.cpp
│   ├── vcfpp.h
│   └── CMakeLists.txt
├── src/
│   ├── haplohyped/   # Core package
│   └── datasets/     # PyTorch datasets
├── tests/            # Test suite
├── docs/             # Documentation
└── environment.yml   # Conda dependencies

Testing

# Run all tests
pytest tests/ -v

# With coverage
pytest tests/ --cov=src --cov-report=html

# Integration tests only
pytest tests/ -m integration -v

Documentation

Architecture Overview

License

MIT - see LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github		.github
cpp		cpp
docs		docs
examples		examples
src		src
tests		tests
.flake8		.flake8
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
check_paths.py		check_paths.py
environment.yml		environment.yml
haplohyped.png		haplohyped.png
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HaploHyped VarAwareML

Features

Installation

Prerequisites

Setup

Manual Installation

Usage

VCF to HDF5 Conversion

Reference Genome Encoding

Python API

Performance

Architecture

Project Structure

Testing

Documentation

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Jaureguy760/HaploHyped-VarAwareML

Folders and files

Latest commit

History

Repository files navigation

HaploHyped VarAwareML

Features

Installation

Prerequisites

Setup

Manual Installation

Usage

VCF to HDF5 Conversion

Reference Genome Encoding

Python API

Performance

Architecture

Project Structure

Testing

Documentation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages