Skip to content

Feature Request: Faster Single-Base Access for Long Reads #1346

Open
@Crispy13

Description

@Crispy13

Problem:

Right now, getting a single base from a pysam read (like read.query_sequence[i]) forces pysam to decode the entire read sequence first. For very long reads (thousands of bases), this can be slow if you only need one or a few bases.

Why it's needed:

When working with long-read sequencing data, if you just want a base at a specific spot on a read (e.g., for a barcode or a known variant position), decoding the whole lengthy sequence just for that one base wastes a lot of time and memory. Other tools (like rust-htslib) can grab just that one base much more efficiently.

Proposed Solution:

Add a new method to AlignedSegment, like read.query_base(index). This method would:

Directly fetch and decode only the requested base from the raw BAM data.
Avoid the overhead of decoding and creating a Python string for the entire read.
Return the single base as a bytes object (e.g., b'A').
Benefit:

This would make tasks that only need a few specific bases from long reads much faster, saving considerable processing time for large datasets.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions