Skip to content

Conversation

@benjeffery
Copy link
Contributor

@benjeffery benjeffery commented Apr 7, 2025

No description provided.

Copy link
Contributor

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Big validation of your general VCZ code.

bio2zarr/ts.py Outdated
self.num_records = self.ts.num_sites
self.positions = self.ts.sites_position

def _make_sample_mapping(self, ploidy):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it belongs in tskit. Let's file an issue there to support once we have something working committed here.

bio2zarr/ts.py Outdated

def iter_field(self, field_name, shape, start, stop):
if field_name == "position":
for pos in self.ts.tables.sites.position[start:stop]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.ts.sites_position - you're creating a copy of the tables every time here!

bio2zarr/ts.py Outdated

self.samples = [vcz.Sample(id=f"tsk_{j}") for j in range(self.num_samples)]

def iter_alleles(self, start, stop, num_alleles):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, it's a bit unfortunate we have to do this. I can see ways to work around though.

@coveralls
Copy link
Collaborator

coveralls commented Apr 8, 2025

Coverage Status

coverage: 98.142% (+0.07%) from 98.074%
when pulling 95c5cf3 on benjeffery:tskit
into e3c31e5 on sgkit-dev:main.

@benjeffery benjeffery force-pushed the tskit branch 2 times, most recently from 868bc9d to 0c6fd5e Compare April 30, 2025 00:16
Copy link
Contributor

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I think we just need an update and a bit more round-trip testing for the initial version. We can make the interface more flexible later, but this is the core of it.

@@ -0,0 +1,287 @@
import logging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about the long-term API, maybe this file should be called tskit.py? Then, clients can do stuff like

bio2zarr.tskit.convert()

I don't think the name-clash of the submodule is a big enough problem to make it worth muddying the interface for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed in e0dc5f6

@benjeffery
Copy link
Contributor Author

I've added a load more testing - found some issues around mixed ploidy and sample nodes that aren't at the start of the nodes table. Would appreciate a check on those bits.

@jeromekelleher
Copy link
Contributor

Looks plausible to me, but I'd have to really dig in to make sure it's watertight.

I think a good way to go might be to identify the API that we want from tskit to do this murky data-model migration stuff, and then make it tskit's responsibility to implement it? So for the initial version we just raise errors for mixed ploidies and stuff, keeping this code reasonably simple, and then try to fix a long-term stable API where tskit becomes responsible for mapping stuff into the VCF-Zarr model.

Ultimately, we just want tskit to give us a (n, max_ploidy) array which defines the sample nodes for individuals and punt all the complexity over there, right?

@jeromekelleher
Copy link
Contributor

jeromekelleher commented Apr 30, 2025

I think we can make the API here something like

ind_nodes = ts.individual_nodes()  # (n, ploidy) array, padded with -1s if mixed ploidy
sample_ids = ["x{j}" for j in range(len(ind_nodes))] # Or could look up metadata if you please
bio2zarr.tskit.convert(ts, ind_nodes, sample_ids=sample_ids)

The default in bio2zarr could be to call ts.individual_nodes() with no parameters if not specified and have the standard sample_ids, but the user then has the flexibility to call ts.individual_nodes() directly with various parameters if necessary, which would be responsible for doing all the quirky model translation.

For now, we can do the individual_nodes for the simple case where individuals are defined, and just error out for anything else while we get ts.individual_nodes() implemented and released in tskit (which should be easy). How does that sound?

Copy link
Contributor

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tthis is going in the right direction, but we need to tackle the efficiency issue up front. We need to

  • Generate a flattened list of nodes that corresponds to the non-missing elements of the individual_nodes array
  • A mapping from this array back into the output genotypes so that we can do something like
for variant in ts.variants(samples=node_map_flattened):
    for ploid in range(max_ploidy):
             genotypes[non_missing[ploid], ploid] = variant.genotypes[non_missing_map[ploid]]

self.node_ids_array = individual_nodes(ts)
self._num_samples = ts.num_individuals
self.max_ploidy = self.node_ids_array.shape[1]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scope of this is too wide, you're just covering the individual_nodes call

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, might as well make this general straight away:

self._num_samples = self.node_ids.shape[0]

gt[i, j] = genotypes[sample_index + j]
sample_index += ploidy
# For each individual, get genotypes for their nodes
for i in range(self.num_samples):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There must be some numpy-way of doings this without iterating over samples - otherwise this is going to be super slow

phased = np.ones(shape[:-1], dtype=bool)

for variant in self.ts.variants(
samples=self.sample_ids,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have to pass samples somehow, otherwise we can't get data for non-sample nodes out.

@benjeffery
Copy link
Contributor Author

Sorry I pushed a bit early as I was moving individual_nodes to the caller. Good idea on the numpy assignment though, I have implemented that.

@benjeffery benjeffery marked this pull request as ready for review May 2, 2025 18:55
@benjeffery
Copy link
Contributor Author

Not sure if this one slipped the net - it's ready for another look over.

Copy link
Contributor

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Spotted a few small things


# Determine max number of alleles
max_alleles = 0
for variant in self.ts.variants():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like overkill - how about we count the maximum number of distinct states? Like

max_alleles = 0
for site in ts.sites():
    states = {site.ancestral_state}
    for mut in site.mutations:
          states.add(mut.derived_state)
    max_alleles = max(len(states), max_alleles)

Could probably be done quicker with numpy, but this'll be fine.

It's a safe maximum, and it'll be correct in the vast majority of cases right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4d5a9b4

This seems right - it can higher if there are back mutations, but I can't see how it could be smaller.

vcz.ZarrArraySpec(
source="position",
name="variant_position",
dtype="i4",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could potentially have position values that won't fit into int32, we should check for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 304a85d

I've made it use i8 if larger positions are found.

vcz.ZarrArraySpec(
source=None,
name="call_genotype",
dtype="i1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should check that max alleles will fit into this type. There's a function like min_dtype or something for this somewhere in this repo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 6eb4caf

def __init__(
self,
ts_path,
individual_nodes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best make this individuals_nodes for consistency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 727e47d

max_position = 0
if self.ts.num_sites > 0:
max_position = np.max(self.ts.sites_position)
position_dtype = "i4" if max_position <= np.iinfo(np.int32).max else "i8"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use min_int_dtype here also?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Done in 27a79ec

Copy link
Contributor

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. looks like ready to squash and merge?

@benjeffery
Copy link
Contributor Author

Squashed! Will merge when CI happy.

@benjeffery
Copy link
Contributor Author

On that note, shall I implement the github auto merge here? Seems to work well over at tskit.

@jeromekelleher
Copy link
Contributor

On that note, shall I implement the github auto merge here? Seems to work well over at tskit.

yes please!

Can you do vcztools also?

@benjeffery benjeffery added this pull request to the merge queue May 8, 2025
@benjeffery
Copy link
Contributor Author

Added to the auto queue, which you can see here: https://github.com/sgkit-dev/bio2zarr/queue/main

@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks May 8, 2025
@benjeffery benjeffery added this pull request to the merge queue May 8, 2025
Merged via the queue into sgkit-dev:main with commit 33c66a3 May 8, 2025
16 checks passed
@benjeffery benjeffery deleted the tskit branch May 8, 2025 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants