Skip to content

Subsetting sequences by coordinates #24

@bschilder

Description

@bschilder

Hello!

I'm trying to use GVL to efficiently construct personalized coding region sequences, and then translate them to amino acids.
I've put together a use case and was hoping you could confirm whether I'm doing this correctly.

Partially related to #19

Reprex

Supporting files

BED file with coding transcript regions:
mane_transcripts.txt

Prepare input data

import pandas as pd
import polars as pl
import genvarloader as gvl
import seqpro as sp
import pooch
import os

# Reference
reference = pooch.retrieve(
        url="https://ftp.ensembl.org/pub/release-112/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz",
        known_hash="sha256:974f97ac8ef7ffae971b63b47608feda327403be40c27e391ee4a1a78b800df5",
        progressbar=True,
    )
bgzip_exec = "bgzip"
os.system(f"gzip -dc {reference} | {bgzip_exec} > {reference[:-3]}.bgz")
reference = reference[:-3] + ".bgz"

# Variants
variants = pooch.retrieve(
    url="doi:10.5281/zenodo.13656224/1kGP.chr22.pgen",
    known_hash="md5:31aba970e35f816701b2b99118dfc2aa",
    progressbar=True,
    fname="1kGP.chr22.pgen",
)

# BED
bed_path = "mane_transcripts.txt"
bed = gvl.read_bedlike(bed_path)
bed

Screenshot 2025-01-09 at 11 16 21 AM

Construct database

chrom = "chr22"
save_path = f"gvl/{chrom}.gvl"
bed_chr = bed.filter(pl.col("chrom")==chrom.replace("chr",""))
if not os.path.exists(save_path):
    print("Creating GVL database ==>",save_path)
    gvl.write(
        path=save_path,
        bed=bed_chr,
        variants=variants,#vcf_dict[chrom],
        length=2**15, # <-- required to select sequence subsets afterwards
        max_mem=16*2**30,
        overwrite=False
    )
print("Loading GVL database")
ds = gvl.Dataset.open(save_path, 
                       reference=reference)
ds

Get coordinates for transcript

The point of confusion of me is that get_bed() returns a list of regions as defined by the window size (length arg in gvl.write). I was expecting this function to return the BED file that was my original input. As far as I can tell, the input BED file is not recorded anywhere within the GVL database. Nor can I find a mapping between which windows/regions contain each of the ranges defined within my input BED file. Some utility functions to make this easy would super helpful, but here's my current workaround.

Get the ranges for one transcript from my input BED file:

# [Region, Sample, Phase]
region_idx = 0
sample_idx = 1 
bed_tx = bed_chr[0]
print(bed_tx)

Screenshot 2025-01-09 at 11 25 57 AM

Merge the original input BED file (transcript ranges) with the database BED file (windows/regions), so we know which regions to extract when trying to query a specific transcript. In this example, only one region needs to be extracted to cover the whole transcript, but in theory a sufficiently large transcript could span multiple regions (or one that happens to span the border of two windows).

regions = ds.get_bed().with_row_index(name="region_idx")
regions_tx = regions.join(bed_tx, how="cross", suffix="_tx")
print(regions_tx)
regions_tx = regions_tx.filter(
    (pl.col("chromStart") <= pl.col("chromEnd_tx")) & 
    (pl.col("chromEnd") >= pl.col("chromStart_tx"))
)
print(regions_tx)

Screenshot 2025-01-09 at 11 26 53 AM

Extract the sequence subset

First we grab the sequence data from the entire window that contains our transcript:

seq_block = ds.isel(regions=regions_tx['region_idx'], samples=0)
seq_block

Here we want to grab the subset of the sequence specified by the input BED file, i.e. the transcript's sequence.
We need to get the offset from the window's start/end positions so we can extract our relevant subset sequence.

tx_len = regions_tx['chromEnd_tx'][0] - regions_tx['chromStart_tx'][0]
tx_start = regions_tx['chromStart_tx'][0] - regions_tx['chromStart'][0]
tx_end = regions_tx['chromEnd_tx'][0] - regions_tx['chromStart'][0] 
tx_start, tx_end, tx_len

Screenshot 2025-01-09 at 11 31 25 AM

Next we extract the sequence and confirm it is indeed the correct length.

seqs = seq_block[:,:,tx_start:tx_end]
print(seqs.shape)
print("sequence length matches:",seqs.shape[2]==tx_len)
seqs

Screenshot 2025-01-09 at 11 33 07 AM

Splicing

Now we need to subset the sequence to assemble the transcript isoform (excluding introns and unused exons).

I use the pysensembl package to extract these coordinates:

def get_db(release=111, species="homo_sapiens"):
    from pyensembl import EnsemblRelease
    db = EnsemblRelease(release=release, species=species) 
    return db
db = get_db()
transcript_id = regions_tx['name'][0].split(".")[0]
tx = db.transcript_by_id(transcript_id)

Now we can subset the full transcript sequence to get the spliced sequence.

We can confirm that it has valid codons (total length divisible by 3).

start = tx.first_start_codon_spliced_offset
end = tx.last_stop_codon_spliced_offset
print(start, end)
seqs_spliced = seqs[:,:,start:end+1]
print(seqs_spliced.shape)
print("Sequence divisible by 3:",seqs_spliced.shape[2]%3==0)
seqs_spliced

Screenshot 2025-01-09 at 11 55 35 AM

We can also confirm that the spliced sequence is the same length as the original sequence, which pysensembl conveniently provides for us. Though this may not always be the case when the sample contains indels that change the length.
and that the resulting spliced sequence is of the same length of the reference spliced sequence

print(len(tx.coding_sequence), seqs_spliced.shape[2])
print("Sequence is the expected length:", seqs_spliced.shape[2]==len(tx.coding_sequence))

Screenshot 2025-01-09 at 11 44 39 AM

Translating to amino acids

So far things are working as expected. But here's where I'm getting a bit confused. I can't seem to find the docs for seqpro so I'm kind of guessing how this is supposed to work:

Feeding in sequences from both haplotypes, with the original object dimensions (3)

sp.AA.translate(seqs_spliced, length_axis=-1)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[70], line 1
----> 1 sp.AA.translate(seqs_spliced, length_axis=-1)

File ~/.conda/envs/genome-loader/lib/python3.12/site-packages/seqpro/alphabets/_alphabets.py:265, in AminoAlphabet.translate(self, seqs, length_axis)
    257 strides = (
    258     *seqs.strides[:length_axis],
    259     codon_size,
    260     seqs.strides[length_axis],
    261     *seqs.strides[length_axis + 1 :],
    262 )
    263 codons = np.lib.stride_tricks.as_strided(seqs, shape=shape, strides=strides)
--> 265 return gufunc_translate(
    266     codons.view(np.uint8),
    267     self.codon_array.view(np.uint8),
    268     self.aa_array.view(np.uint8),
    269     axes=[codon_axis, (-2, -1), (-1), ()],  # type: ignore
    270 ).view("S1")

ValueError: gufunc_translate: Input operand 2 has a mismatch in its core dimension 0, with gufunc signature (k),(j,k),(j)->() (size 21 is different from 64)

Feeding in a single sequence:

sp.AA.translate(seqs_spliced[0], length_axis=-1)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[65], line 1
----> 1 sp.AA.translate(seqs_spliced[0], length_axis=-1)

File ~/.conda/envs/genome-loader/lib/python3.12/site-packages/seqpro/alphabets/_alphabets.py:265, in AminoAlphabet.translate(self, seqs, length_axis)
    257 strides = (
    258     *seqs.strides[:length_axis],
    259     codon_size,
    260     seqs.strides[length_axis],
    261     *seqs.strides[length_axis + 1 :],
    262 )
    263 codons = np.lib.stride_tricks.as_strided(seqs, shape=shape, strides=strides)
--> 265 return gufunc_translate(
    266     codons.view(np.uint8),
    267     self.codon_array.view(np.uint8),
    268     self.aa_array.view(np.uint8),
    269     axes=[codon_axis, (-2, -1), (-1), ()],  # type: ignore
    270 ).view("S1")

ValueError: gufunc_translate: Input operand 2 has a mismatch in its core dimension 0, with gufunc signature (k),(j,k),(j)->() (size 21 is different from 64)

We also would like to translate the pysensembl sequence and compare the result with the AA sequence produced using their internal functions.

ref_aa_pyensembl = tx.protein_sequence
print(ref_aa_pyensembl)
ref_aa_gvl = sp.AA.translate(tx.coding_sequence, length_axis=-1)
MVEMLPTAILLVLAVSVVAKDNATCDGPCGLRFRQNPQGGVRIVGGKAAQHGAWPWMVSLQIFTYNSHRYHTCGGSLLNSRWVLTAAHCFVGKNNVHDWRLVFGAKEITYGNNKPVKAPLQERYVEKIIIHEKYNSATEGNDIALVEITPPISCGRFIGPGCLPHFKAGLPRGSQSCWVAGWGYIEEKAPRPSSILMEARVDLIDLDLCNSTQWYNGRVQPTNVCAGYPVGKIDTCQGDSGGPLMCKDSKESAYVVVGITSWGVGCARAKRPGIYTATWPYLNWIASKIGSNALRMIQSATPPPPTTRPPPIRPPFSHPISAHLPWYFQPPPRPLPPRPPAAQPRPPPSPPPPPPPPASPLPPPPPPPPPTPSSTTKLPQGLSFAKRLQQLIEVLKGKTYSDGKNHYDMETTELPELTSTS
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[72], line 3
      1 ref_aa_pyensembl = tx.protein_sequence
      2 print(ref_aa_pyensembl)
----> 3 sp.AA.translate(tx.coding_sequence, length_axis=-1)

File ~/.conda/envs/genome-loader/lib/python3.12/site-packages/seqpro/alphabets/_alphabets.py:265, in AminoAlphabet.translate(self, seqs, length_axis)
    257 strides = (
    258     *seqs.strides[:length_axis],
    259     codon_size,
    260     seqs.strides[length_axis],
    261     *seqs.strides[length_axis + 1 :],
    262 )
    263 codons = np.lib.stride_tricks.as_strided(seqs, shape=shape, strides=strides)
--> 265 return gufunc_translate(
    266     codons.view(np.uint8),
    267     self.codon_array.view(np.uint8),
    268     self.aa_array.view(np.uint8),
    269     axes=[codon_axis, (-2, -1), (-1), ()],  # type: ignore
    270 ).view("S1")

ValueError: gufunc_translate: Input operand 2 has a mismatch in its core dimension 0, with gufunc signature (k),(j,k),(j)->() (size 21 is different from 64)

Thanks in advance for your help!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions