Skip to content

at-cg/HALE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

HALE (Haplotype-Aware Long-read Error correction) is a haplotype-aware error correction tool designed for long reads. It supports correction of both ONT Simplex and PacBio HiFi reads. The algorithm implemented in HALE builds on the Minimum Error Correction (MEC) optimization framework, commonly used in read phasing tools such as WhatsHap.

Dependencies

  • Linux OS (tested on Cent OS 8 and Ubuntu 22.04)

  • rustup (See installation instructions here)

  • Python 3.1 or above and conda for data preprocessing

  • Make sure the following system packages are installed (Linux):

    • build-essential, autoconf, libtool, pkg-config, time
      • To install these on Ubuntu/Debian: you can use the command: apt-get install build-essential autoconf libtool pkg-config time

    Note: If you're using a Linux system, there's a good chance these system packages are already installed, especially if development tools have been previously set up.

Try HALE on Small Test Data

The entire test workflow below will take about 5-6 minutes. Users can either run the commands one by one or copy the commands into an executable script.

# Install HALE 
git clone https://github.com/at-cg/HALE.git
cd HALE
cargo build -q --release

# Create conda env
conda env create --file scripts/hale-env.yml
conda activate hale

mkdir -p test_run && cd test_run/

# download small test dataset
wget https://github.com/chhylp123/hifiasm/releases/download/v0.7/chr11-2M.fa.gz

# Run hale correction
../target/release/hale correct --reads chr11-2M.fa.gz --threads 8 --depth 40 --ploidy 2 --tech hifi

Note: If you encounter compilation issues, try disabling Conda before running cargo build --release

Visualize Corrections

After the above test run, users can also visualize the error reduction by aligning the raw reads and the corrected reads to HG002 diploid genome assembly. Subsequently, load the alignments (BAM) in IGV.

IGV comparison of raw vs corrected reads

Figure: Read alignments indicating the reduction in sequencing errors (colored bars) after HALE correction.

Installation

  1. Clone the repository:
git clone https://github.com/at-cg/HALE.git
  1. Compile the source code:
cd HALE
cargo build -q --release
  1. Create conda env
conda env create --file scripts/hale-env.yml
conda activate hale

Usage

#Produce help page
$ /path/to/HALE/target/release/hale -h
HALE — Haplotype-Aware Long-read Error Correction

Usage:
  hale correct [OPTIONS]

Required options:
  --reads <file>        Input fastq or fastq.gz
  --threads <int>       Number of threads
  --depth <int>         Sequencing depth
  --ploidy <int>        Genome ploidy
  --tech <ont|hifi>     Sequencing technology

Example:
  hale correct \
      --reads /path/to/sample.fastq.gz \
      --threads 32 \
      --depth 60 \
      --ploidy 2 \
      --tech ont

Output file : hale_corrected_sample.fastq.gz will be created in the same directory.

Note:

  • Flag --depth represent dataset depth (default 60x)
  • Flag--ploidy represent ploidy of genome (default 2)
  • Temporary alignment files are removed automatically upon successful completion.

Implementation Notes:

Automatic pipeline selection

HALE automatically selects the appropriate correction pipeline based on the sequencing technology:

  • PacBio HiFi

    • Single round of all-vs-all overlap
    • One HALE correction step
  • ONT Simplex

    • Three rounds of all-vs-all overlap
    • Two pre-correction rounds (pih mode)
    • One final correction round (hale mode)

No additional flags are required beyond --tech.

Other dependencies (handled via conda)

The conda environment installs:

  • minimap2
  • seqkit
  • samtools
  • Python dependencies for preprocessing

Acknowledgement

This code uses components of HERRO, developed by Dominik Stanojevic for preprocessing reads, e.g., computing all-vs-all read overlaps and multiple sequence alignment.

Packages

No packages published

Contributors 2

  •  
  •