HALE (Haplotype-Aware Long-read Error correction) is a haplotype-aware error correction tool designed for long reads. It supports correction of both ONT Simplex and PacBio HiFi reads. The algorithm implemented in HALE builds on the Minimum Error Correction (MEC) optimization framework, commonly used in read phasing tools such as WhatsHap.
-
Linux OS (tested on Cent OS 8 and Ubuntu 22.04)
-
Python 3.1 or above and conda for data preprocessing
-
Make sure the following system packages are installed (Linux):
build-essential,autoconf,libtool,pkg-config,time- To install these on Ubuntu/Debian: you can use the command:
apt-get install build-essential autoconf libtool pkg-config time
- To install these on Ubuntu/Debian: you can use the command:
Note: If you're using a Linux system, there's a good chance these system packages are already installed, especially if development tools have been previously set up.
The entire test workflow below will take about 5-6 minutes. Users can either run the commands one by one or copy the commands into an executable script.
# Install HALE
git clone https://github.com/at-cg/HALE.git
cd HALE
cargo build -q --release
# Create conda env
conda env create --file scripts/hale-env.yml
conda activate hale
mkdir -p test_run && cd test_run/
# download small test dataset
wget https://github.com/chhylp123/hifiasm/releases/download/v0.7/chr11-2M.fa.gz
# Run hale correction
../target/release/hale correct --reads chr11-2M.fa.gz --threads 8 --depth 40 --ploidy 2 --tech hifi
Note: If you encounter compilation issues, try disabling Conda before running cargo build --release
After the above test run, users can also visualize the error reduction by aligning the raw reads and the corrected reads to HG002 diploid genome assembly. Subsequently, load the alignments (BAM) in IGV.
Figure: Read alignments indicating the reduction in sequencing errors (colored bars) after HALE correction.
- Clone the repository:
git clone https://github.com/at-cg/HALE.git- Compile the source code:
cd HALE
cargo build -q --release- Create conda env
conda env create --file scripts/hale-env.yml
conda activate hale#Produce help page
$ /path/to/HALE/target/release/hale -h
HALE — Haplotype-Aware Long-read Error Correction
Usage:
hale correct [OPTIONS]
Required options:
--reads <file> Input fastq or fastq.gz
--threads <int> Number of threads
--depth <int> Sequencing depth
--ploidy <int> Genome ploidy
--tech <ont|hifi> Sequencing technology
Example:
hale correct \
--reads /path/to/sample.fastq.gz \
--threads 32 \
--depth 60 \
--ploidy 2 \
--tech ontOutput file : hale_corrected_sample.fastq.gz will be created in the same directory.
Note:
- Flag
--depthrepresent dataset depth (default 60x) - Flag
--ploidyrepresent ploidy of genome (default 2) - Temporary alignment files are removed automatically upon successful completion.
HALE automatically selects the appropriate correction pipeline based on the sequencing technology:
-
PacBio HiFi
- Single round of all-vs-all overlap
- One HALE correction step
-
ONT Simplex
- Three rounds of all-vs-all overlap
- Two pre-correction rounds (
pihmode) - One final correction round (
halemode)
No additional flags are required beyond --tech.
The conda environment installs:
- minimap2
- seqkit
- samtools
- Python dependencies for preprocessing
This code uses components of HERRO, developed by Dominik Stanojevic for preprocessing reads, e.g., computing all-vs-all read overlaps and multiple sequence alignment.
