Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
bbbbda9
docs: fix version format to be vX.Y.Z
d-laub Mar 10, 2025
2719433
feat: initial prototype for splicing.
d-laub Mar 11, 2025
9804e0b
feat(wip): testing spliced return values
d-laub Mar 11, 2025
ea039af
Merge branch 'main' into dlaub/splice
d-laub Apr 2, 2025
13229c9
feat!: move indices and transformation to torch dataset/dataloader AP…
d-laub Apr 4, 2025
f169d4b
test: update for breaking changes in API.
d-laub Apr 4, 2025
13dfad9
feat: add members to conveniently inspect dataset splicing info.
d-laub Apr 6, 2025
3df050a
fix: spliced i2d_map
d-laub Apr 8, 2025
53716e2
fix: __getitem__ type annotations for StrIdx
d-laub Apr 8, 2025
db33714
Merge branch 'main' into dlaub/splice
d-laub Apr 18, 2025
db849e0
fix: update spliced_bed in with_settings for splice_info
d-laub Apr 18, 2025
16cf149
fix: parsing splice info and returning single item instead of list
d-laub Apr 18, 2025
353750b
chore: wip for fixing cat_length
d-laub Apr 20, 2025
17050b9
chore: fix cat_helper for splicing
d-laub Apr 21, 2025
bd5525c
chore: wip on svar support
d-laub Apr 21, 2025
d773a25
feat: SVAR support passes all tests
d-laub Apr 23, 2025
c7b606b
fix: add spanning dels to test and fix hap ilens for this case
d-laub Apr 23, 2025
cb8129a
Merge branch 'dlaub/svar' into dlaub/splice
d-laub Apr 24, 2025
892ced2
fix: variant index -> variant info mapping
d-laub Apr 24, 2025
f054895
build: update dependencies
d-laub Apr 25, 2025
3f06258
chore: wip on svar support
d-laub Apr 21, 2025
049e9a8
feat: SVAR support passes all tests
d-laub Apr 23, 2025
ba2d1c6
fix: add spanning dels to test and fix hap ilens for this case
d-laub Apr 23, 2025
e1e6f78
fix: continue migrating to seqpro Ragged, enable logger at module lev…
d-laub Apr 29, 2025
2650716
bump: version 0.12.0 → 0.13.0
github-actions[bot] Apr 30, 2025
222daef
build: change gh workflows to run on stable
d-laub Apr 30, 2025
d820b7a
build: change tag format
d-laub Apr 30, 2025
6630714
build: bump dependencies
d-laub Apr 30, 2025
6b8fa5d
docs: update requirements
d-laub Apr 30, 2025
7537162
docs: annotate doc requirements
d-laub Apr 30, 2025
d36be00
build: bump rust extension, python ABI compatibility
d-laub Apr 30, 2025
6fa4244
chore: merge conflicts
d-laub Apr 30, 2025
e9eeb0a
Merge branch 'main' into dlaub/splice
d-laub Apr 30, 2025
248d959
style: ignore type error on view using str argument
d-laub Apr 30, 2025
9c6988f
Merge branch 'main' into dlaub/splice
d-laub Apr 30, 2025
1d71a3d
Merge branch 'main' into dlaub/splice
d-laub Apr 30, 2025
5665fcb
Merge branch 'main' into dlaub/splice
d-laub May 9, 2025
d0aa8b9
ci: update lockfile
d-laub May 9, 2025
14e90b4
test: remove return indices option
d-laub May 9, 2025
86947f6
test: more precise types
d-laub May 9, 2025
6aee077
fix: map contig names appropriately for bounds checking on ds regions…
d-laub May 9, 2025
bf26df7
Merge branch 'main' into dlaub/splice
d-laub May 10, 2025
90b0172
style: ruff formatting
d-laub May 10, 2025
bbfbfd5
ci: update lockfile
d-laub May 10, 2025
e924936
ci: update publish workflow
d-laub May 10, 2025
47ca825
ci: update publish workflow
d-laub May 10, 2025
b2a295c
bump: version 0.14.2 → 0.14.3
github-actions[bot] May 10, 2025
9923cdd
ci: update publish workflow name
d-laub May 10, 2025
0a566b5
ci: update publish workflow
d-laub May 10, 2025
b480940
ci: update publish workflow
d-laub May 10, 2025
c2b1e3e
ci: update workflows
d-laub May 10, 2025
d21485d
ci: update workflows
d-laub May 10, 2025
14f6725
docs: test if py3.11 fixes pgenlib installation
d-laub May 11, 2025
fe7a2c9
fix: data corruption when rc_helper is parallelized
d-laub May 12, 2025
14a83a3
bump: version 0.14.3 → 0.14.4
github-actions[bot] May 12, 2025
68c0c56
test: add tests for reverse complemented data
d-laub May 12, 2025
e231ff9
Merge branch 'main' into dlaub/splice
d-laub May 12, 2025
3c258b0
Merge branch 'main' into dlaub/splice
d-laub May 19, 2025
d8aa12d
Merge branch 'main' into dlaub/splice
d-laub May 21, 2025
85a5b3c
fix: virtual indexing for splice indexer
d-laub May 22, 2025
b55f181
Merge branch 'main' into dlaub/splice
d-laub May 25, 2025
4f2ce16
fix: exons are already in reverse order for negative stranded genes
d-laub May 26, 2025
854b5a1
Merge branch 'main' into dlaub/splice
d-laub May 27, 2025
f01ab0c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 27, 2025
ad8e486
fix: make sure exonic filter gets applied. style: adhere to pre-commit
d-laub May 27, 2025
1f85b60
Merge branch 'main' into dlaub/splice
d-laub May 27, 2025
857a86a
Merge branch 'main' into dlaub/splice
d-laub May 27, 2025
ae4c677
Merge branch 'main' into dlaub/splice
d-laub May 27, 2025
54b5c81
Merge branch 'main' into dlaub/splice
d-laub May 27, 2025
84505e4
chore: sync lockfile
d-laub May 27, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/bump.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,6 @@ jobs:
uses: softprops/action-gh-release@v2
with:
body_path: "body.md"
tag_name: ${{ env.REVISION }}
tag_name: v${{ env.REVISION }}
env:
GITHUB_TOKEN: ${{ secrets.COMMITIZEN }}
9 changes: 9 additions & 0 deletions CHANGELOG.md → docs/source/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -852,6 +852,15 @@

## v0.9.0 (2025-03-06)

This is a breaking change for GVL. Users should view the ["What's a `gvl.Dataset`?"](https://genvarloader.readthedocs.io/en/latest/dataset.html) page in the documentation for details, but major breaks include:

- removed the `length` argument from `gvl.write()`. Regions/BED files are now used as-is. If you want uniform length regions centered on inputs/peaks as before, preprocess your BED file with `gvl.with_length`.
- changed `Dataset.output_length` from a property to a dynamic setting with behavior describe in the "What's a gvl.Dataset?" page.
- changed track output shape to have a track axis.
- Datasets are now deterministic by default.

As a result of these changes, GVL seamlessly supports ragged length output and also paves the way for on-the-fly splicing. Since many changes were made, I wouldn't be surprised if a few bugs crop up despite my best efforts -- please leave issues if so!

### Feat

- option to return ragged data from gvl.Dataset. output_length is set dynamically. fix: hap reconstruction matches bcftools. change default for Dataset.deterministic from False to True. change track output from a list of arrays to having a track dimension i.e. from shape (b [p] l) to (b t [p] l). docs: add dataset.md, faq.md and overhaul geuvadis.ipynb to be simpler and reflect changes in API.
Expand Down
21 changes: 21 additions & 0 deletions docs/source/changelog.md.j2
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Changelog

{% for entry in tree %}

## {{ entry.version }}{% if entry.date %} ({{ entry.date }}){% endif %}

{% for change_key, changes in entry.changes.items() %}

{% if change_key %}
### {{ change_key }}
{% endif %}

{% for change in changes %}
{% if change.scope %}
- **{{ change.scope }}**: {{ change.message }}
{% elif change.message %}
- {{ change.message }}
{% endif %}
{% endfor %}
{% endfor %}
{% endfor %}
1 change: 1 addition & 0 deletions docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ write
geuvadis
faq
api
changelog
```

# GenVarLoader
Expand Down
267 changes: 267 additions & 0 deletions docs/source/splicing.ipynb

Large diffs are not rendered by default.

4,084 changes: 1,359 additions & 2,725 deletions pixi.lock

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions pixi.toml
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,8 @@ ipywidgets = "*"
sphinx-book-theme = "*"
sphinx-autobuild = "*"
sphinx-autodoc-typehints = "*"
seaborn = "*"
fast-histogram = "*"

[feature.pytorch-cpu.dependencies]
pytorch-cpu = ">=2,<3"
Expand Down Expand Up @@ -97,6 +99,7 @@ gen = "python tests/data/generate_ground_truth.py"
test = { cmd = "pytest tests && cargo test --release", depends-on = ["gen"] }

[feature.docs.tasks]
install-e = "uv pip install -e /cellar/users/dlaub/projects/ML4GLand/SeqPro -e /cellar/users/dlaub/projects/genoray -e ."
i-kernel = "ipython kernel install --user --name 'gvl-docs' --display-name 'GVL Docs'"
i-kernel-gpu = "ipython kernel install --user --name 'gvl-docs-gpu' --display-name 'GVL Docs GPU'"
doc = "cd docs && make clean && make html"
5 changes: 5 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ reportUninitializedInstanceVariable = false
[tool.maturin]
python-source = "python"
features = ["pyo3/extension-module"]
# compatibility = "manylinux_2_28"

[tool.pytest.ini_options]
filterwarnings = [
Expand All @@ -83,6 +84,10 @@ legacy_tag_formats = ['v$version']
version_scheme = "semver2"
version_provider = "pep621"
update_changelog_on_bump = true
changelog_file = 'docs/source/changelog.md'
changlog_incremental = true
changelog_start_rev = "v0.9.1"
template = "docs/source/changelog.md.j2"
major_version_zero = true
allowed_prefixes = ["Merge", "Revert", "Pull request", "fixup!", "squash!", "[pre-commit.ci]"]

Expand Down
110 changes: 105 additions & 5 deletions python/genvarloader/_dataset/_genotypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ def get_diffs_sparse(
return diffs


@nb.njit(parallel=True, nogil=True, cache=True)
# @nb.njit(parallel=True, nogil=True, cache=True)
def reconstruct_haplotypes_from_sparse(
out: NDArray[np.uint8],
out_offsets: NDArray[np.integer],
Expand All @@ -117,9 +117,9 @@ def reconstruct_haplotypes_from_sparse(
ref_offsets: NDArray[np.integer],
pad_char: int,
keep: NDArray[np.bool_] | None = None,
keep_offsets: NDArray[np.int64] | None = None,
annot_v_idxs: NDArray[np.int32] | None = None,
annot_ref_pos: NDArray[np.int32] | None = None,
keep_offsets: NDArray[np.integer] | None = None,
annot_v_idxs: NDArray[np.integer] | None = None,
annot_ref_pos: NDArray[np.integer] | None = None,
):
"""Reconstruct haplotypes from reference sequence and variants.

Expand Down Expand Up @@ -211,7 +211,7 @@ def reconstruct_haplotypes_from_sparse(
)


@nb.njit(nogil=True, cache=True)
# @nb.njit(nogil=True, cache=True)
def reconstruct_haplotype_from_sparse(
offset_idx: int,
geno_v_idxs: NDArray[np.integer],
Expand Down Expand Up @@ -407,3 +407,103 @@ def reconstruct_haplotype_from_sparse(
annot_v_idxs[out_end_idx:] = -1
if annot_ref_pos is not None:
annot_ref_pos[out_end_idx:] = np.iinfo(np.int32).max


@nb.njit(parallel=True, nogil=True, cache=True)
def choose_exonic_variants(
starts: NDArray[np.integer],
ends: NDArray[np.integer],
geno_offset_idxs: NDArray[np.integer],
geno_v_idxs: NDArray[np.integer],
geno_offsets: NDArray[np.integer],
v_starts: NDArray[np.integer],
ilens: NDArray[np.integer],
) -> tuple[NDArray[np.bool_], NDArray[np.integer]]:
"""Mark variants to keep for each haplotype.

Parameters
----------
starts : NDArray[np.int32]
Shape = (n_regions) Start positions for each region.
ends : NDArray[np.int32]
Shape = (n_regions) Ends for each region.
geno_offset_idxs : NDArray[np.intp]
Shape = (n_regions, ploidy) Indices for each region into offsets.
offsets : NDArray[np.int64]
Shape = (total_variants + 1) Offsets into sparse genotypes.
sparse_genos : NDArray[np.int32]
Shape = (total_variants) Sparse genotypes i.e. variant indices for ALT genotypes.
positions : NDArray[np.int32]
Shape = (total_variants) Positions of variants.
sizes : NDArray[np.int32]
Shape = (total_variants) Sizes of variants.
deterministic : bool
Whether to deterministically assign variants to groups
"""
n_regions, ploidy = geno_offset_idxs.shape

lengths = np.empty((n_regions, ploidy), np.int64)
for query in nb.prange(n_regions):
for hap in range(ploidy):
o_idx = geno_offset_idxs[query, hap]
if geno_offsets.ndim == 1:
o_s, o_e = geno_offsets[o_idx], geno_offsets[o_idx + 1]
else:
o_s, o_e = geno_offsets[o_idx]
lengths[query, hap] = o_e - o_s
keep_offsets = np.empty(n_regions * ploidy + 1, np.int64)
keep_offsets[0] = 0
keep_offsets[1:] = lengths.cumsum()

n_variants = keep_offsets[-1]
keep = np.empty(n_variants, np.bool_)

for query in nb.prange(n_regions):
ref_start: int = starts[query]
ref_end: int = ends[query]
for hap in nb.prange(ploidy):
o_idx = geno_offset_idxs[query, hap]
o_s, o_e = geno_offsets[o_idx], geno_offsets[o_idx + 1]
qh_genos = geno_v_idxs[o_s:o_e]

k_idx = query * ploidy + hap
k_s, k_e = keep_offsets[k_idx], keep_offsets[k_idx + 1]
qh_keep = keep[k_s:k_e]

_choose_exonic_variants(
query_start=ref_start,
query_end=ref_end,
variant_idxs=qh_genos,
positions=v_starts,
sizes=ilens,
keep=qh_keep,
)

return keep, keep_offsets


@nb.njit(nogil=True, cache=True)
def _choose_exonic_variants(
query_start: int,
query_end: int,
variant_idxs: NDArray[np.integer], # (v)
positions: NDArray[np.integer], # (total variants)
sizes: NDArray[np.integer], # (total variants)
keep: NDArray[np.bool_], # (v)
):
"""Create a mask for variants that are fully contained within the query interval, which is
assumed to correspond to the exon boundaries."""
# no variants
if len(variant_idxs) == 0:
return

for v in range(len(variant_idxs)):
v_idx: int = variant_idxs[v]
v_pos = positions[v_idx]
# +1 for atomized
v_ref_end = v_pos - min(0, sizes[v_idx]) + 1

if v_pos >= query_start and v_ref_end <= query_end:
keep[v] = True
else:
keep[v] = False
Loading