Description
Problem:
Right now, getting a single base from a pysam read (like read.query_sequence[i]) forces pysam to decode the entire read sequence first. For very long reads (thousands of bases), this can be slow if you only need one or a few bases.
Why it's needed:
When working with long-read sequencing data, if you just want a base at a specific spot on a read (e.g., for a barcode or a known variant position), decoding the whole lengthy sequence just for that one base wastes a lot of time and memory. Other tools (like rust-htslib) can grab just that one base much more efficiently.
Proposed Solution:
Add a new method to AlignedSegment, like read.query_base(index). This method would:
Directly fetch and decode only the requested base from the raw BAM data.
Avoid the overhead of decoding and creating a Python string for the entire read.
Return the single base as a bytes object (e.g., b'A').
Benefit:
This would make tasks that only need a few specific bases from long reads much faster, saving considerable processing time for large datasets.