Skip to content

Conversation

@d-laub
Copy link
Collaborator

@d-laub d-laub commented Mar 11, 2025

Closes #24.

  • docs: fix version format to be vX.Y.Z

  • feat: initial prototype for splicing.

  • Splice regions together

  • Allow different definition of an overlapping variant to be fully exonic and not overlapping with splice sites a la Haplosaurus.

  • Update Dataset API (or maybe a new class) to reflect different shape and definition of a row.

  • Tests against Haplosaurus on 1kGP chr22 @bschilder

  • Performance issues, possibly from slow RC

@d-laub d-laub marked this pull request as ready for review April 6, 2025 01:17
@d-laub d-laub marked this pull request as draft April 6, 2025 01:18
@d-laub d-laub added the type: enhancement New feature or request label Apr 8, 2025
@bschilder bschilder mentioned this pull request Apr 15, 2025
@d-laub d-laub marked this pull request as ready for review April 30, 2025 03:56
@bschilder
Copy link
Collaborator

The .gvi file is an Apache Feather file that stores non-genotype info for every variant so we can do fast intersections with pyranges and pull up this info for GVL during writes too.

Got it, so it's doing something quite different from the .tbi index.
Since this is something GVL creates, would it be worth automatically deleting it when gvl.write(overwrite=True)? Or if you want more precise control, only overwrite the .gvi index file when overwrite>1.

The .gvi indices are implemented in genoray and correspond to the variant file, not a gvl.Dataset, so the .write method never needs to delete an index but can create one if needed. Similar to how we might use bcftools index on a VCF if it wasn't indexed already. However, we don't want to add a bcftools dependency at this time.

The only reason manual manipulation of the genoray index comes up in the example notebook is because of on-the-fly filtering for bi-allelic sites before the call to gvl.write. The tricky part for VCFs is that the variant filter has to accept cyvcf2.Variants, whereas the index filter has to a polars expression and the two have to semantically match. I can raise an error if the index filter is missing, but I can't programmatically check that the filters are the same without iterating over the entire VCF, at which point there is no performance/QOL advantage to passing a polars filter in the first place. If the filters don't match the GVL write could error out due to length mismatches or -- much worse -- silently contain corrupt data. At a minimum I plan to document this clearly before demonstrating and documenting on-the-fly filtering, but this still relies on users reading carefully.

Ok got it, that makes sense. Yeah I guess documentation is the best way to go for now. Thanks for the thorough explanation!

@bschilder
Copy link
Collaborator

Ok, so I tried taking your advice (also corrected chromEnd by subtracting 1).

I also skipped doing any reverse complement steps since that's already taken care of by GVL by default:

I think we've independently confirmed some of my findings in the example notebook:

  • GTF coordinates are 1-based, so they have to be adjusted to be 0-based to match the BED spec. As in the example notebook, this means using .with_columns(pl.col("chromStart") - 1) on the GTF.
  • "exon" features in the example GTF include the 5' and 3' UTRs, so stick to "CDS" as you've tried.
  • Make sure the ref genome build is the same between the GTF, the source of variants, and the reference genome used with Dataet.open. A mismatch across any of these could result in nonsense output.

Amino acid-level

Sequence similarity is definitely improved (25% --> 50%) but still not great.
image

I'm still noticing a lot of excess stop codons in the GVL sequences (in general and compared to the Haplosaurus seqs), which suggests something is going wrong there:
Screenshot 2025-05-12 at 3 18 42 PM

Sequence lengths between GVL and Haplosaurus seqs can also be quite different, which a median sequence length difference of 7 AAs (the range is from 3 to 1513 AAs).

Mean number of stop codons for GVL seqs is 40, ranging from 8 (min) to 75 (max). This is another indicator that something is up with the spliced GVL seqs.

Screenshot 2025-05-12 at 3 45 38 PM

Nucleotide-level

Nucleotide seq similarity is even lower.
image

@d-laub
Copy link
Collaborator Author

d-laub commented May 12, 2025

Hey Brian, this looks good! Thanks to @BradBalderson I just caught and fixed a bug in the reverse complement function. I think it would make sense that ~half of the sequences are way off with that bug. Can you try again with the latest commits?

@bschilder
Copy link
Collaborator

bschilder commented May 13, 2025

Hey Brian, this looks good! Thanks to @BradBalderson I just caught and fixed a bug in the reverse complement function. I think it would make sense that ~half of the sequences are way off with that bug. Can you try again with the latest commits?

Without VCF normalisation

Using the VCF directly to create the GVL db, without any normalisation with bcftoools.

Nucleotide-level

MUCH better seq sim for nuc level.

image

Amino acid-level

Unfortunately, still lots of stop codons:
image

And actually the AA seq sim is a bit lower than before:
image

With VCF normalisation

[placeholder]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type: enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Subsetting sequences by coordinates

3 participants