Splicing #38

d-laub · 2025-03-11T04:14:28Z

Closes #24.

docs: fix version format to be vX.Y.Z
feat: initial prototype for splicing.
Splice regions together
Allow different definition of an overlapping variant to be fully exonic and not overlapping with splice sites a la Haplosaurus.
Update Dataset API (or maybe a new class) to reflect different shape and definition of a row.
Tests against Haplosaurus on 1kGP chr22 @bschilder
Performance issues, possibly from slow RC

…I since these are generally never needed outside that context. feat: fully functional zero-copy splicing mechanics. fix: bug in rev and rev comp causing garbage output.

bschilder · 2025-05-12T15:37:57Z

The .gvi file is an Apache Feather file that stores non-genotype info for every variant so we can do fast intersections with pyranges and pull up this info for GVL during writes too.

Got it, so it's doing something quite different from the .tbi index.
Since this is something GVL creates, would it be worth automatically deleting it when gvl.write(overwrite=True)? Or if you want more precise control, only overwrite the .gvi index file when overwrite>1.

The .gvi indices are implemented in genoray and correspond to the variant file, not a gvl.Dataset, so the .write method never needs to delete an index but can create one if needed. Similar to how we might use bcftools index on a VCF if it wasn't indexed already. However, we don't want to add a bcftools dependency at this time.

The only reason manual manipulation of the genoray index comes up in the example notebook is because of on-the-fly filtering for bi-allelic sites before the call to gvl.write. The tricky part for VCFs is that the variant filter has to accept cyvcf2.Variants, whereas the index filter has to a polars expression and the two have to semantically match. I can raise an error if the index filter is missing, but I can't programmatically check that the filters are the same without iterating over the entire VCF, at which point there is no performance/QOL advantage to passing a polars filter in the first place. If the filters don't match the GVL write could error out due to length mismatches or -- much worse -- silently contain corrupt data. At a minimum I plan to document this clearly before demonstrating and documenting on-the-fly filtering, but this still relies on users reading carefully.

Ok got it, that makes sense. Yeah I guess documentation is the best way to go for now. Thanks for the thorough explanation!

bschilder · 2025-05-12T19:47:09Z

Ok, so I tried taking your advice (also corrected chromEnd by subtracting 1).

I also skipped doing any reverse complement steps since that's already taken care of by GVL by default:

Questions about strands #78 (comment)

I think we've independently confirmed some of my findings in the example notebook:

GTF coordinates are 1-based, so they have to be adjusted to be 0-based to match the BED spec. As in the example notebook, this means using .with_columns(pl.col("chromStart") - 1) on the GTF.

"exon" features in the example GTF include the 5' and 3' UTRs, so stick to "CDS" as you've tried.

Make sure the ref genome build is the same between the GTF, the source of variants, and the reference genome used with Dataet.open. A mismatch across any of these could result in nonsense output.

Amino acid-level

Sequence similarity is definitely improved (25% --> 50%) but still not great.

I'm still noticing a lot of excess stop codons in the GVL sequences (in general and compared to the Haplosaurus seqs), which suggests something is going wrong there:

Sequence lengths between GVL and Haplosaurus seqs can also be quite different, which a median sequence length difference of 7 AAs (the range is from 3 to 1513 AAs).

Mean number of stop codons for GVL seqs is 40, ranging from 8 (min) to 75 (max). This is another indicator that something is up with the spliced GVL seqs.

Nucleotide-level

Nucleotide seq similarity is even lower.

d-laub · 2025-05-12T20:45:41Z

Hey Brian, this looks good! Thanks to @BradBalderson I just caught and fixed a bug in the reverse complement function. I think it would make sense that ~half of the sequences are way off with that bug. Can you try again with the latest commits?

bschilder · 2025-05-13T02:16:34Z

Hey Brian, this looks good! Thanks to @BradBalderson I just caught and fixed a bug in the reverse complement function. I think it would make sense that ~half of the sequences are way off with that bug. Can you try again with the latest commits?

Without VCF normalisation

Using the VCF directly to create the GVL db, without any normalisation with bcftoools.

Nucleotide-level

MUCH better seq sim for nuc level.

Amino acid-level

Unfortunately, still lots of stop codons:

And actually the AA seq sim is a bit lower than before:

With VCF normalisation

[placeholder]

for more information, see https://pre-commit.ci

d-laub added 2 commits March 9, 2025 19:54

docs: fix version format to be vX.Y.Z

bbbbda9

feat: initial prototype for splicing.

2719433

d-laub self-assigned this Mar 11, 2025

d-laub modified the milestones: Sequences exactly corresponding to input regions, Spliced sequences Mar 11, 2025

d-laub added 5 commits March 11, 2025 15:08

feat(wip): testing spliced return values

9804e0b

Merge branch 'main' into dlaub/splice

ea039af

feat!: move indices and transformation to torch dataset/dataloader AP…

13229c9

…I since these are generally never needed outside that context. feat: fully functional zero-copy splicing mechanics. fix: bug in rev and rev comp causing garbage output.

test: update for breaking changes in API.

f169d4b

feat: add members to conveniently inspect dataset splicing info.

13dfad9

d-laub marked this pull request as ready for review April 6, 2025 01:17

d-laub marked this pull request as draft April 6, 2025 01:18

d-laub assigned bschilder Apr 8, 2025

d-laub added the type: enhancement New feature or request label Apr 8, 2025

d-laub added 2 commits April 8, 2025 11:38

fix: spliced i2d_map

3df050a

fix: __getitem__ type annotations for StrIdx

53716e2

bschilder mentioned this pull request Apr 15, 2025

Splicing tests #57

Open

d-laub added 11 commits April 18, 2025 09:20

Merge branch 'main' into dlaub/splice

db33714

fix: update spliced_bed in with_settings for splice_info

db849e0

fix: parsing splice info and returning single item instead of list

16cf149

chore: wip for fixing cat_length

353750b

chore: fix cat_helper for splicing

17050b9

chore: wip on svar support

bd5525c

feat: SVAR support passes all tests

d773a25

fix: add spanning dels to test and fix hap ilens for this case

c7b606b

Merge branch 'dlaub/svar' into dlaub/splice

cb8129a

fix: variant index -> variant info mapping

892ced2

build: update dependencies

f054895

d-laub marked this pull request as ready for review April 30, 2025 03:56

d-laub and others added 13 commits May 12, 2025 13:44

ci: update publish workflow

e924936

ci: update publish workflow

47ca825

bump: version 0.14.2 → 0.14.3

b2a295c

ci: update publish workflow name

9923cdd

ci: update publish workflow

0a566b5

ci: update publish workflow

b480940

ci: update workflows

c2b1e3e

ci: update workflows

d21485d

docs: test if py3.11 fixes pgenlib installation

14f6725

fix: data corruption when rc_helper is parallelized

fe7a2c9

bump: version 0.14.3 → 0.14.4

14a83a3

test: add tests for reverse complemented data

68c0c56

Merge branch 'main' into dlaub/splice

e231ff9

d-laub and others added 13 commits May 18, 2025 21:20

Merge branch 'main' into dlaub/splice

3c258b0

Merge branch 'main' into dlaub/splice

d8aa12d

fix: virtual indexing for splice indexer

85a5b3c

Merge branch 'main' into dlaub/splice

b55f181

fix: exons are already in reverse order for negative stranded genes

4f2ce16

Merge branch 'main' into dlaub/splice

854b5a1

[pre-commit.ci] auto fixes from pre-commit.com hooks

f01ab0c

for more information, see https://pre-commit.ci

fix: make sure exonic filter gets applied. style: adhere to pre-commit

ad8e486

Merge branch 'main' into dlaub/splice

1f85b60

Merge branch 'main' into dlaub/splice

857a86a

Merge branch 'main' into dlaub/splice

ae4c677

Merge branch 'main' into dlaub/splice

54b5c81

chore: sync lockfile

84505e4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Splicing #38

Splicing #38

Uh oh!

d-laub commented Mar 11, 2025 •

edited

Loading

Uh oh!

bschilder commented May 12, 2025

Uh oh!

bschilder commented May 12, 2025

Uh oh!

d-laub commented May 12, 2025

Uh oh!

bschilder commented May 13, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Splicing #38

Are you sure you want to change the base?

Splicing #38

Uh oh!

Conversation

d-laub commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bschilder commented May 12, 2025

Uh oh!

bschilder commented May 12, 2025

Amino acid-level

Nucleotide-level

Uh oh!

d-laub commented May 12, 2025

Uh oh!

bschilder commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Without VCF normalisation

Nucleotide-level

Amino acid-level

With VCF normalisation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

d-laub commented Mar 11, 2025 •

edited

Loading

bschilder commented May 13, 2025 •

edited

Loading