Subsetting sequences by coordinates

Hello!

I'm trying to use GVL to efficiently construct personalized coding region sequences, and then translate them to amino acids.
I've put together a use case and was hoping you could confirm whether I'm doing this correctly.

Partially related to #19 

## Reprex

### Supporting files
BED file with coding transcript regions:
[mane_transcripts.txt](https://github.com/user-attachments/files/18364199/mane_transcripts.txt)

### Prepare input data

```python
import pandas as pd
import polars as pl
import genvarloader as gvl
import seqpro as sp
import pooch
import os

# Reference
reference = pooch.retrieve(
        url="https://ftp.ensembl.org/pub/release-112/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz",
        known_hash="sha256:974f97ac8ef7ffae971b63b47608feda327403be40c27e391ee4a1a78b800df5",
        progressbar=True,
    )
bgzip_exec = "bgzip"
os.system(f"gzip -dc {reference} | {bgzip_exec} > {reference[:-3]}.bgz")
reference = reference[:-3] + ".bgz"

# Variants
variants = pooch.retrieve(
    url="doi:10.5281/zenodo.13656224/1kGP.chr22.pgen",
    known_hash="md5:31aba970e35f816701b2b99118dfc2aa",
    progressbar=True,
    fname="1kGP.chr22.pgen",
)

# BED
bed_path = "mane_transcripts.txt"
bed = gvl.read_bedlike(bed_path)
bed
```
![Screenshot 2025-01-09 at 11 16 21 AM](https://github.com/user-attachments/assets/f79da36b-a654-4460-8093-bb29cbca34d8)

### Construct database
```python
chrom = "chr22"
save_path = f"gvl/{chrom}.gvl"
bed_chr = bed.filter(pl.col("chrom")==chrom.replace("chr",""))
if not os.path.exists(save_path):
    print("Creating GVL database ==>",save_path)
    gvl.write(
        path=save_path,
        bed=bed_chr,
        variants=variants,#vcf_dict[chrom],
        length=2**15, # <-- required to select sequence subsets afterwards
        max_mem=16*2**30,
        overwrite=False
    )
print("Loading GVL database")
ds = gvl.Dataset.open(save_path, 
                       reference=reference)
ds
```

### Get coordinates for transcript

The point of confusion of me is that `get_bed()` returns a list of regions as defined by the window size (`length` arg in `gvl.write`). I was expecting this function to return the BED file that was my original input. As far as I can tell, the input BED file is not recorded anywhere within the GVL database. Nor can I find a mapping between which windows/regions contain each of the ranges defined within my input BED file. Some utility functions to make this easy would super helpful, but here's my current workaround.

Get the ranges for one transcript from my input BED file:
```python
# [Region, Sample, Phase]
region_idx = 0
sample_idx = 1 
bed_tx = bed_chr[0]
print(bed_tx)
```
![Screenshot 2025-01-09 at 11 25 57 AM](https://github.com/user-attachments/assets/a5dced33-bba6-4589-9098-50a2b0e83d90)

Merge the original input BED file (transcript ranges) with the database BED file (windows/regions), so we know which regions to extract when trying to query a specific transcript. In this example, only one region needs to be extracted to cover the whole transcript, but in theory a sufficiently large transcript could span multiple regions (or one that happens to span the border of two windows).

```python
regions = ds.get_bed().with_row_index(name="region_idx")
regions_tx = regions.join(bed_tx, how="cross", suffix="_tx")
print(regions_tx)
regions_tx = regions_tx.filter(
    (pl.col("chromStart") <= pl.col("chromEnd_tx")) & 
    (pl.col("chromEnd") >= pl.col("chromStart_tx"))
)
print(regions_tx)
```
![Screenshot 2025-01-09 at 11 26 53 AM](https://github.com/user-attachments/assets/d049ccf3-f263-4abc-912c-57554371349c)

### Extract the sequence subset

First we grab the sequence data from the entire window that contains our transcript:

```python
seq_block = ds.isel(regions=regions_tx['region_idx'], samples=0)
seq_block
```

Here we want to grab the subset of the sequence specified by the input BED file, i.e. the transcript's sequence.
We need to get the offset from the window's start/end positions so we can extract our relevant subset sequence.
```python
tx_len = regions_tx['chromEnd_tx'][0] - regions_tx['chromStart_tx'][0]
tx_start = regions_tx['chromStart_tx'][0] - regions_tx['chromStart'][0]
tx_end = regions_tx['chromEnd_tx'][0] - regions_tx['chromStart'][0] 
tx_start, tx_end, tx_len
```
![Screenshot 2025-01-09 at 11 31 25 AM](https://github.com/user-attachments/assets/8f4228b8-b27f-4098-991e-adb86dc1416c)

Next we extract the sequence and confirm it is indeed the correct length.
```python
seqs = seq_block[:,:,tx_start:tx_end]
print(seqs.shape)
print("sequence length matches:",seqs.shape[2]==tx_len)
seqs
```
![Screenshot 2025-01-09 at 11 33 07 AM](https://github.com/user-attachments/assets/3faec62f-bca3-45b4-976d-dabc6b43d753)

### Splicing
Now we need to subset the sequence to assemble the transcript isoform (excluding introns and unused exons).

I use the [`pysensembl` package](https://github.com/openvax/pyensembl) to extract these coordinates:

```python
def get_db(release=111, species="homo_sapiens"):
    from pyensembl import EnsemblRelease
    db = EnsemblRelease(release=release, species=species) 
    return db
db = get_db()
transcript_id = regions_tx['name'][0].split(".")[0]
tx = db.transcript_by_id(transcript_id)
```

Now we can subset the full transcript sequence to get the spliced sequence. 

We can confirm that it has valid codons (total length divisible by 3).
```python
start = tx.first_start_codon_spliced_offset
end = tx.last_stop_codon_spliced_offset
print(start, end)
seqs_spliced = seqs[:,:,start:end+1]
print(seqs_spliced.shape)
print("Sequence divisible by 3:",seqs_spliced.shape[2]%3==0)
seqs_spliced
```
![Screenshot 2025-01-09 at 11 55 35 AM](https://github.com/user-attachments/assets/fb3ad0b3-9c8c-4023-a866-2cc25d51a01b)

We can also confirm that the spliced sequence is the same length as the original sequence, which `pysensembl` conveniently provides for us. Though this may not always be the case when the sample contains indels that change the length.
and that the resulting spliced sequence is of the same length of the reference spliced sequence 
```python
print(len(tx.coding_sequence), seqs_spliced.shape[2])
print("Sequence is the expected length:", seqs_spliced.shape[2]==len(tx.coding_sequence))
```
![Screenshot 2025-01-09 at 11 44 39 AM](https://github.com/user-attachments/assets/447e31bd-f8b6-4894-a30a-240d10791b3a)



### Translating to amino acids

So far things are working as expected. But here's where I'm getting a bit confused. I can't seem to find the docs for `seqpro` so I'm kind of guessing how this is supposed to work:

Feeding in sequences from both haplotypes, with the original object dimensions (3)
```python
sp.AA.translate(seqs_spliced, length_axis=-1)
```
```python
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[70], line 1
----> 1 sp.AA.translate(seqs_spliced, length_axis=-1)

File ~/.conda/envs/genome-loader/lib/python3.12/site-packages/seqpro/alphabets/_alphabets.py:265, in AminoAlphabet.translate(self, seqs, length_axis)
    257 strides = (
    258     *seqs.strides[:length_axis],
    259     codon_size,
    260     seqs.strides[length_axis],
    261     *seqs.strides[length_axis + 1 :],
    262 )
    263 codons = np.lib.stride_tricks.as_strided(seqs, shape=shape, strides=strides)
--> 265 return gufunc_translate(
    266     codons.view(np.uint8),
    267     self.codon_array.view(np.uint8),
    268     self.aa_array.view(np.uint8),
    269     axes=[codon_axis, (-2, -1), (-1), ()],  # type: ignore
    270 ).view("S1")

ValueError: gufunc_translate: Input operand 2 has a mismatch in its core dimension 0, with gufunc signature (k),(j,k),(j)->() (size 21 is different from 64)
```

Feeding in a single sequence:
```python
sp.AA.translate(seqs_spliced[0], length_axis=-1)
```
```python
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[65], line 1
----> 1 sp.AA.translate(seqs_spliced[0], length_axis=-1)

File ~/.conda/envs/genome-loader/lib/python3.12/site-packages/seqpro/alphabets/_alphabets.py:265, in AminoAlphabet.translate(self, seqs, length_axis)
    257 strides = (
    258     *seqs.strides[:length_axis],
    259     codon_size,
    260     seqs.strides[length_axis],
    261     *seqs.strides[length_axis + 1 :],
    262 )
    263 codons = np.lib.stride_tricks.as_strided(seqs, shape=shape, strides=strides)
--> 265 return gufunc_translate(
    266     codons.view(np.uint8),
    267     self.codon_array.view(np.uint8),
    268     self.aa_array.view(np.uint8),
    269     axes=[codon_axis, (-2, -1), (-1), ()],  # type: ignore
    270 ).view("S1")

ValueError: gufunc_translate: Input operand 2 has a mismatch in its core dimension 0, with gufunc signature (k),(j,k),(j)->() (size 21 is different from 64)
```

We also would like to translate the `pysensembl` sequence and compare the result with the AA sequence produced using their internal functions.
```python
ref_aa_pyensembl = tx.protein_sequence
print(ref_aa_pyensembl)
ref_aa_gvl = sp.AA.translate(tx.coding_sequence, length_axis=-1)
```

```python
MVEMLPTAILLVLAVSVVAKDNATCDGPCGLRFRQNPQGGVRIVGGKAAQHGAWPWMVSLQIFTYNSHRYHTCGGSLLNSRWVLTAAHCFVGKNNVHDWRLVFGAKEITYGNNKPVKAPLQERYVEKIIIHEKYNSATEGNDIALVEITPPISCGRFIGPGCLPHFKAGLPRGSQSCWVAGWGYIEEKAPRPSSILMEARVDLIDLDLCNSTQWYNGRVQPTNVCAGYPVGKIDTCQGDSGGPLMCKDSKESAYVVVGITSWGVGCARAKRPGIYTATWPYLNWIASKIGSNALRMIQSATPPPPTTRPPPIRPPFSHPISAHLPWYFQPPPRPLPPRPPAAQPRPPPSPPPPPPPPASPLPPPPPPPPPTPSSTTKLPQGLSFAKRLQQLIEVLKGKTYSDGKNHYDMETTELPELTSTS
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[72], line 3
      1 ref_aa_pyensembl = tx.protein_sequence
      2 print(ref_aa_pyensembl)
----> 3 sp.AA.translate(tx.coding_sequence, length_axis=-1)

File ~/.conda/envs/genome-loader/lib/python3.12/site-packages/seqpro/alphabets/_alphabets.py:265, in AminoAlphabet.translate(self, seqs, length_axis)
    257 strides = (
    258     *seqs.strides[:length_axis],
    259     codon_size,
    260     seqs.strides[length_axis],
    261     *seqs.strides[length_axis + 1 :],
    262 )
    263 codons = np.lib.stride_tricks.as_strided(seqs, shape=shape, strides=strides)
--> 265 return gufunc_translate(
    266     codons.view(np.uint8),
    267     self.codon_array.view(np.uint8),
    268     self.aa_array.view(np.uint8),
    269     axes=[codon_axis, (-2, -1), (-1), ()],  # type: ignore
    270 ).view("S1")

ValueError: gufunc_translate: Input operand 2 has a mismatch in its core dimension 0, with gufunc signature (k),(j,k),(j)->() (size 21 is different from 64)
```


Thanks in advance for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Subsetting sequences by coordinates #24

Reprex

Supporting files

Prepare input data

Construct database

Get coordinates for transcript

Extract the sequence subset

Splicing

Translating to amino acids

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Subsetting sequences by coordinates #24

Description

Reprex

Supporting files

Prepare input data

Construct database

Get coordinates for transcript

Extract the sequence subset

Splicing

Translating to amino acids

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions