Extract sequences from BAM, PAF, or FASTA files using BED coordinates. A fast, reliable tool for sequence extraction that handles structural variants, insertions, and complex alignments.
bedpull extracts sequences from alignment files (BAM) or assemblies (FASTA via PAF alignments) based on BED region coordinates. Unlike traditional coordinate lifting tools like liftOver, bedpull uses CIGAR-aware extraction to correctly handle:
- Large insertions and deletions
- Structural variants
- Complex rearrangements
- Phased haplotype assemblies (coming soon!)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source $HOME/.cargo/env
git clone https://github.com/Psy-Fer/bedpull
cd bedpull
cargo build --release
./target/release/bedpull --helpAdd /<top dir>/bedpull/target/release/ to your $PATH for easy use of the bedpull binary
rustup target add x86_64-unknown-linux-musl
RUSTFLAGS='-C link-arg=-s'
cargo build --release --target x86_64-unknown-linux-musl
./target/x86_64-unknown-linux-musl/release/bedpull --help
You can then copy that binary across and use it.
- Rust 1.70 or higher
- For BAM extraction: indexed BAM file (.bai)
- For PAF extraction: indexed FASTA file (.fai)
bedpull --bam alignments.bam \
--bed regions.bed \
--output sequences.fastabedpull --paf assembly_to_reference.paf \
--query_ref assembly.fasta \
--bed regions.bed \
--output sequences.fastaOptions:
-b, --bam <FILE> Input BAM file (for extracting aligned reads)
--paf <FILE> Input PAF file (for assembly-to-reference extraction)
-q, --query_ref <FILE> Query FASTA file (required with --paf)
-r, --bed <FILE> BED file with regions to extract
-o, --output <FILE> Output file (fasta/fastq)
--fastq Output FASTQ format (BAM only)
--mapq <INT> Minimum mapping quality [default: 0]
-h, --help Print help
-V, --version Print version
Extract repeat regions from phased assemblies to create ground truth genotypes for benchmarking:
# Extract STR loci from HG002 paternal haplotype
bedpull --paf hg002pat_to_hs1.paf \
--query_ref hg002_paternal.fasta \
--bed clinical_str_sites.bed \
--output hg002_str_sequences.fastaCorrectly extract sequences spanning large insertions or deletions:
Example: RFC1 locus with 520bp insertion
Using liftover to get regions
liftOver ./rfc1.bed hg002pat_to_hs1.chain hg002_paternal_regions_rfc1.bed unmapped_pat_rfc1.bed
Results:
Reference (hs1): chr4:39318077-39318136 (59 bp)
HG002 paternal (liftover): chr4_PATERNAL:39438551-39438610 (59bp)
HG002 paternal (bedpull): chr4_PATERNAL:39438031-39438610 (579 bp)
Insertion captured by bedpull: 520 bp
liftover misses the 520bp insertion, but bedpull picks it up
Extract similar regions from maternal and paternal haplotypes:
# Maternal haplotype
bedpull --paf hg002mat_to_ref.paf \
--query_ref hg002_maternal.fasta \
--bed regions.bed \
--output maternal_sequences.fasta
# Paternal haplotype
bedpull --paf hg002pat_to_ref.paf \
--query_ref hg002_paternal.fasta \
--bed regions.bed \
--output paternal_sequences.fasta- For each BED region, find overlapping alignments
- Use CIGAR string to calculate exact query positions
- Extract the aligned portion of the read sequence
- Handles insertions, deletions, and clipping
- Builds an index of the PAF file (
*.paf.idx) - For each BED region, query the index for overlapping alignments
- Use CIGAR string to calculate exact query positions
- Extract sequence from the query FASTA file using calculated positions
- Return sequences with both reference and query coordinates in header
- Write bed file with query coordinates
Traditional coordinate conversion tools (like liftOver) fail when:
- Large insertions exist in one assembly
- Structural rearrangements disrupt alignment impacting chain file generation
- Multiple alignment blocks complicate coordinate mapping
bedpull solves this by:
- parsing CIGAR operations of full alignments to get exact coordinates
Standard 3-column BED format (0-based):
chr1 1000 2000
chr2 5000 5500
Optional 4th column for region names:
chr1 1000 2000 region1
chr4 39318077 39318136 RFC1
Must be coordinate-sorted and indexed (.bai file in same directory)
samtools view -bS example.sam | samtools sort -o example.bam
samtools index example.bam
Standard PAF format from minimap2 or similar aligners. Must include CIGAR string (cg:Z: tag).
Example alignment command:
minimap2 -cx asm5 --cs=long -t 16 reference.fasta query.fasta > alignment.pafMust be indexed (.fai file). Create index with:
samtools faidx assembly.fastaBAM extraction:
>read_name|reference_region|alignment_info
PAF extraction:
>query_name|ref_region:start-end|query_region:start-end
>chr4_PATERNAL|ref_RFC1:39318077-39318136|query_chr4_PATERNAL:39438031-39438610
- map quality filtering
- fastq mean qscore filtering (for the subqual string)
- secondary and sup alignment handling
- phased haplotype handling
- make a consensus from reads of same region/haplotype
- minimum number of reads to use for consensus building
- reorganisation of some functions to house similar things together (paf and bam are a shambles)
- better handle 4th column in bed file for naming
- rethink header structure, make some standard spec for it across all outputs
- add threading for very large bams/pafs
- add more checks for files, write permissions, arg combination limits (like --fastq only with --bam), etc
- add some example files for people to test/play with
If you use bedpull in your research, please cite:
GitHub: https://github.com/Psy-Fer/bedpull
MIT License - see LICENSE file for details
James Ferguson (@Psy-Fer)
Garvan Institute of Medical Research
Issues and pull requests welcome!
- bladerunner - STR detection and genotyping
- samtools - SAM/BAM manipulation
- bedtools - Genome arithmetic
- minimap2 - Sequence alignment
