Feature Request: Faster Single-Base Access for Long Reads

Problem:

Right now, getting a single base from a pysam read (like read.query_sequence[i]) forces pysam to decode the entire read sequence first. For very long reads (thousands of bases), this can be slow if you only need one or a few bases.

Why it's needed:

When working with long-read sequencing data, if you just want a base at a specific spot on a read (e.g., for a barcode or a known variant position), decoding the whole lengthy sequence just for that one base wastes a lot of time and memory. Other tools (like rust-htslib) can grab just that one base much more efficiently.

Proposed Solution:

Add a new method to AlignedSegment, like read.query_base(index). This method would:

Directly fetch and decode only the requested base from the raw BAM data.
Avoid the overhead of decoding and creating a Python string for the entire read.
Return the single base as a bytes object (e.g., b'A').
Benefit:

This would make tasks that only need a few specific bases from long reads much faster, saving considerable processing time for large datasets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Faster Single-Base Access for Long Reads #1346

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Faster Single-Base Access for Long Reads #1346

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions