- 
                Notifications
    You must be signed in to change notification settings 
- Fork 6
Description
Hello!
I'm trying to use GVL to efficiently construct personalized coding region sequences, and then translate them to amino acids.
I've put together a use case and was hoping you could confirm whether I'm doing this correctly.
Partially related to #19
Reprex
Supporting files
BED file with coding transcript regions:
mane_transcripts.txt
Prepare input data
import pandas as pd
import polars as pl
import genvarloader as gvl
import seqpro as sp
import pooch
import os
# Reference
reference = pooch.retrieve(
        url="https://ftp.ensembl.org/pub/release-112/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz",
        known_hash="sha256:974f97ac8ef7ffae971b63b47608feda327403be40c27e391ee4a1a78b800df5",
        progressbar=True,
    )
bgzip_exec = "bgzip"
os.system(f"gzip -dc {reference} | {bgzip_exec} > {reference[:-3]}.bgz")
reference = reference[:-3] + ".bgz"
# Variants
variants = pooch.retrieve(
    url="doi:10.5281/zenodo.13656224/1kGP.chr22.pgen",
    known_hash="md5:31aba970e35f816701b2b99118dfc2aa",
    progressbar=True,
    fname="1kGP.chr22.pgen",
)
# BED
bed_path = "mane_transcripts.txt"
bed = gvl.read_bedlike(bed_path)
bedConstruct database
chrom = "chr22"
save_path = f"gvl/{chrom}.gvl"
bed_chr = bed.filter(pl.col("chrom")==chrom.replace("chr",""))
if not os.path.exists(save_path):
    print("Creating GVL database ==>",save_path)
    gvl.write(
        path=save_path,
        bed=bed_chr,
        variants=variants,#vcf_dict[chrom],
        length=2**15, # <-- required to select sequence subsets afterwards
        max_mem=16*2**30,
        overwrite=False
    )
print("Loading GVL database")
ds = gvl.Dataset.open(save_path, 
                       reference=reference)
dsGet coordinates for transcript
The point of confusion of me is that get_bed() returns a list of regions as defined by the window size (length arg in gvl.write). I was expecting this function to return the BED file that was my original input. As far as I can tell, the input BED file is not recorded anywhere within the GVL database. Nor can I find a mapping between which windows/regions contain each of the ranges defined within my input BED file. Some utility functions to make this easy would super helpful, but here's my current workaround.
Get the ranges for one transcript from my input BED file:
# [Region, Sample, Phase]
region_idx = 0
sample_idx = 1 
bed_tx = bed_chr[0]
print(bed_tx)Merge the original input BED file (transcript ranges) with the database BED file (windows/regions), so we know which regions to extract when trying to query a specific transcript. In this example, only one region needs to be extracted to cover the whole transcript, but in theory a sufficiently large transcript could span multiple regions (or one that happens to span the border of two windows).
regions = ds.get_bed().with_row_index(name="region_idx")
regions_tx = regions.join(bed_tx, how="cross", suffix="_tx")
print(regions_tx)
regions_tx = regions_tx.filter(
    (pl.col("chromStart") <= pl.col("chromEnd_tx")) & 
    (pl.col("chromEnd") >= pl.col("chromStart_tx"))
)
print(regions_tx)Extract the sequence subset
First we grab the sequence data from the entire window that contains our transcript:
seq_block = ds.isel(regions=regions_tx['region_idx'], samples=0)
seq_blockHere we want to grab the subset of the sequence specified by the input BED file, i.e. the transcript's sequence.
We need to get the offset from the window's start/end positions so we can extract our relevant subset sequence.
tx_len = regions_tx['chromEnd_tx'][0] - regions_tx['chromStart_tx'][0]
tx_start = regions_tx['chromStart_tx'][0] - regions_tx['chromStart'][0]
tx_end = regions_tx['chromEnd_tx'][0] - regions_tx['chromStart'][0] 
tx_start, tx_end, tx_lenNext we extract the sequence and confirm it is indeed the correct length.
seqs = seq_block[:,:,tx_start:tx_end]
print(seqs.shape)
print("sequence length matches:",seqs.shape[2]==tx_len)
seqsSplicing
Now we need to subset the sequence to assemble the transcript isoform (excluding introns and unused exons).
I use the pysensembl package to extract these coordinates:
def get_db(release=111, species="homo_sapiens"):
    from pyensembl import EnsemblRelease
    db = EnsemblRelease(release=release, species=species) 
    return db
db = get_db()
transcript_id = regions_tx['name'][0].split(".")[0]
tx = db.transcript_by_id(transcript_id)Now we can subset the full transcript sequence to get the spliced sequence.
We can confirm that it has valid codons (total length divisible by 3).
start = tx.first_start_codon_spliced_offset
end = tx.last_stop_codon_spliced_offset
print(start, end)
seqs_spliced = seqs[:,:,start:end+1]
print(seqs_spliced.shape)
print("Sequence divisible by 3:",seqs_spliced.shape[2]%3==0)
seqs_splicedWe can also confirm that the spliced sequence is the same length as the original sequence, which pysensembl conveniently provides for us. Though this may not always be the case when the sample contains indels that change the length.
and that the resulting spliced sequence is of the same length of the reference spliced sequence
print(len(tx.coding_sequence), seqs_spliced.shape[2])
print("Sequence is the expected length:", seqs_spliced.shape[2]==len(tx.coding_sequence))Translating to amino acids
So far things are working as expected. But here's where I'm getting a bit confused. I can't seem to find the docs for seqpro so I'm kind of guessing how this is supposed to work:
Feeding in sequences from both haplotypes, with the original object dimensions (3)
sp.AA.translate(seqs_spliced, length_axis=-1)---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[70], line 1
----> 1 sp.AA.translate(seqs_spliced, length_axis=-1)
File ~/.conda/envs/genome-loader/lib/python3.12/site-packages/seqpro/alphabets/_alphabets.py:265, in AminoAlphabet.translate(self, seqs, length_axis)
    257 strides = (
    258     *seqs.strides[:length_axis],
    259     codon_size,
    260     seqs.strides[length_axis],
    261     *seqs.strides[length_axis + 1 :],
    262 )
    263 codons = np.lib.stride_tricks.as_strided(seqs, shape=shape, strides=strides)
--> 265 return gufunc_translate(
    266     codons.view(np.uint8),
    267     self.codon_array.view(np.uint8),
    268     self.aa_array.view(np.uint8),
    269     axes=[codon_axis, (-2, -1), (-1), ()],  # type: ignore
    270 ).view("S1")
ValueError: gufunc_translate: Input operand 2 has a mismatch in its core dimension 0, with gufunc signature (k),(j,k),(j)->() (size 21 is different from 64)Feeding in a single sequence:
sp.AA.translate(seqs_spliced[0], length_axis=-1)---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[65], line 1
----> 1 sp.AA.translate(seqs_spliced[0], length_axis=-1)
File ~/.conda/envs/genome-loader/lib/python3.12/site-packages/seqpro/alphabets/_alphabets.py:265, in AminoAlphabet.translate(self, seqs, length_axis)
    257 strides = (
    258     *seqs.strides[:length_axis],
    259     codon_size,
    260     seqs.strides[length_axis],
    261     *seqs.strides[length_axis + 1 :],
    262 )
    263 codons = np.lib.stride_tricks.as_strided(seqs, shape=shape, strides=strides)
--> 265 return gufunc_translate(
    266     codons.view(np.uint8),
    267     self.codon_array.view(np.uint8),
    268     self.aa_array.view(np.uint8),
    269     axes=[codon_axis, (-2, -1), (-1), ()],  # type: ignore
    270 ).view("S1")
ValueError: gufunc_translate: Input operand 2 has a mismatch in its core dimension 0, with gufunc signature (k),(j,k),(j)->() (size 21 is different from 64)We also would like to translate the pysensembl sequence and compare the result with the AA sequence produced using their internal functions.
ref_aa_pyensembl = tx.protein_sequence
print(ref_aa_pyensembl)
ref_aa_gvl = sp.AA.translate(tx.coding_sequence, length_axis=-1)MVEMLPTAILLVLAVSVVAKDNATCDGPCGLRFRQNPQGGVRIVGGKAAQHGAWPWMVSLQIFTYNSHRYHTCGGSLLNSRWVLTAAHCFVGKNNVHDWRLVFGAKEITYGNNKPVKAPLQERYVEKIIIHEKYNSATEGNDIALVEITPPISCGRFIGPGCLPHFKAGLPRGSQSCWVAGWGYIEEKAPRPSSILMEARVDLIDLDLCNSTQWYNGRVQPTNVCAGYPVGKIDTCQGDSGGPLMCKDSKESAYVVVGITSWGVGCARAKRPGIYTATWPYLNWIASKIGSNALRMIQSATPPPPTTRPPPIRPPFSHPISAHLPWYFQPPPRPLPPRPPAAQPRPPPSPPPPPPPPASPLPPPPPPPPPTPSSTTKLPQGLSFAKRLQQLIEVLKGKTYSDGKNHYDMETTELPELTSTS
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[72], line 3
      1 ref_aa_pyensembl = tx.protein_sequence
      2 print(ref_aa_pyensembl)
----> 3 sp.AA.translate(tx.coding_sequence, length_axis=-1)
File ~/.conda/envs/genome-loader/lib/python3.12/site-packages/seqpro/alphabets/_alphabets.py:265, in AminoAlphabet.translate(self, seqs, length_axis)
    257 strides = (
    258     *seqs.strides[:length_axis],
    259     codon_size,
    260     seqs.strides[length_axis],
    261     *seqs.strides[length_axis + 1 :],
    262 )
    263 codons = np.lib.stride_tricks.as_strided(seqs, shape=shape, strides=strides)
--> 265 return gufunc_translate(
    266     codons.view(np.uint8),
    267     self.codon_array.view(np.uint8),
    268     self.aa_array.view(np.uint8),
    269     axes=[codon_axis, (-2, -1), (-1), ()],  # type: ignore
    270 ).view("S1")
ValueError: gufunc_translate: Input operand 2 has a mismatch in its core dimension 0, with gufunc signature (k),(j,k),(j)->() (size 21 is different from 64)Thanks in advance for your help!






