Genome region query API #25

alimanfoo · 2020-07-08T16:36:52Z

Raising this issue to discuss API for selecting data from a given genome region, which could be either a whole contig or a contiguous region within a contig.

Breaking this out from discussion thread here.

alimanfoo · 2020-07-08T16:42:21Z

A few options that come to mind:

Using an accessor:
ds.genetics.sel_contig(slice('chr1'))
Accessor specific to functions on attributes:
ds.sel(variant=ds['variant/contig'] == ds.meta.contig_index('chr1'))
Use contig index directly:
ds.sel(variant=ds['variant/contig'] == 0) # for chr1
I think it makes more sense to focus on the situation with an index though, which is a bit more clear on it's own IMO:
contig_index = ds.attrs['contigs'].index('chr1')
contig_range = slice(contig_index, contig_index)
# or contig_range = slice(0, 0) for chr1
bp_pos_range = slice(0, 1000000)
ds.set_index(variant=('variant/contig', 'variant/pos')).sel(variant=(contig_range, bp_pos_range))
Any thoughts/preferences on those?

FWIW I don't know the best API but as a user I'd like to never have to use contig indices. That creates indirection that makes code harder to read and understand. Particularly where you have contig IDs that don't have a natural sequential order (e.g., in Anopheles we have "2L", "2R", "3L", "3R", "X").

Ideally I'd like a simple method or function where I can pass in three arguments: the contig name, and the start and stop positions of the region (1-based, stop-inclusive). Like the Python equivalent of a tabix query.

alimanfoo · 2020-07-08T16:47:10Z

Btw apologies in advance if I get opinionated here or elsewhere. I just know I'm going to end up having to explain this to others, and it will really help to have an API that is just obvious and natural from a geneticist point of view. If ever I express a strong opinion please feel free to push back 😄

alimanfoo · 2020-07-08T16:53:15Z

Using an accessor:
ds.genetics.sel_contig(slice('chr1'))

FWIW general API design opinion, as a user I don't like xarray accessors. It's an odd piece of programming that is hard to explain to new Python users. There should only be plain functions, or classes with methods, IMO. I.e., either sgkit.do_something(ds, args) or ds.do_something(args).

alimanfoo · 2020-07-08T16:59:16Z

I think it makes more sense to focus on the situation with an index though, which is a bit more clear on it's own IMO:

contig_index = ds.attrs['contigs'].index('chr1')
contig_range = slice(contig_index, contig_index)
# or contig_range = slice(0, 0) for chr1
bp_pos_range = slice(0, 1000000)
ds.set_index(variant=('variant/contig', 'variant/pos')).sel(variant=(contig_range, bp_pos_range))

For a user, having to make a slice to express a single contig is awkward. I know this is trying to work within the capabilities of the native xarray API, but if that is awkward for the user then we should make better functions or methods.

E.g., as a user I'd like to be able to do something like this to select a whole contig:

ds.sel_contig('chr1')

...or this to select a region within a contig:

ds.sel_contig_region('chr1', 1, 1_000_000)

eric-czech · 2020-07-08T17:32:19Z

It doesn't have to be a slice btw, scalars are ok:

contig_index = ds.attrs['contigs'].index('chr1')
ds.set_index(variant=('variant/contig', 'variant/pos')).sel(variant=(contig_index, slice(0, 100000)))

Maybe there's a need for a module acting like what ignite is to pytorch or keras to tensorflow, or just more functions in the top level since subclassing is not a good choice. It seems like trying to hide xarray completely from users would be to their detriment, but I can certainly understand what you mean if they don't even necessarily know python.

jeromekelleher · 2020-07-08T17:40:45Z

I like Alistair's thinking here. What about

ds["chr1"] # All the variants in ch1
ds["chr1", 1234] # variant at position 1234 on chr1
ds["chr1", 1234:5678] # Variants from 1234 to 5678 (we can argue about start-stop semantics later!)
ds[1234:5679] # Checks the dataset contains only one contig, and returns variatns from 1234:5678

This should be possible with a custom __getitem__, right? There'd be some fiddling with deciding what types are provided and so on, but it could be made to work.

eric-czech · 2020-07-08T18:00:46Z

I think it would be nice to override Datasets too, but Xarray won't preserve our subclass through some parts of the API since they recreate them often internally (one of the reasons they recommend against sublcassing). Are you guys thinking of an extension mechanism beyond accessors or subclasses? Maybe I'm missing something.

jeromekelleher · 2020-07-08T18:13:00Z

I think it would be nice to override Datasets too, but Xarray won't preserve our subclass through some parts of the API since they recreate them often internally (one of the reasons they recommend against sublcassing). Are you guys thinking of an extension mechanism beyond accessors or subclasses? Maybe I'm missing something.

I'm not really thinking about xarray at all, so I'm missing lots of important technical details! Think of the above as "this is what I'd like as a user". Maybe it's not practical.

eric-czech · 2020-07-08T18:44:57Z

Ah I see, those all sound good to me then. I can't immediately see a nice way to do it without switching to a wrapper class around a Dataset, but perhaps it acts more like a "view" of some kind that isn't central to the other parts of the API (details for later).

alimanfoo · 2020-07-08T20:57:56Z

I'm not really thinking about xarray at all, so I'm missing lots of important technical details! Think of the above as "this is what I'd like as a user". Maybe it's not practical.

DItto, I have no experience of xarray so may be unaware of important constraints. All I know about xarray so far is what you've taught me :-)

FWIW I definitely would agree we don't want to hide xarray. Part of sgkit is introducing users to the key tools in the pydata ecosystem, and so hopefully people will learn something about xarray (and numpy and dask) through using sgkit.

But I don't want to compromise on usability either.

Not quite sure what the best direction to push here is.

eric-czech · 2020-07-09T12:17:17Z

But I don't want to compromise on usability either.

Fair enough! FWIW, I'm +1 for a core API based on Xarray and higher-level functions/classes that hide it. I think any Xarray-free APIs should always use our core Xarray API under the hood, and that making a clear distinction between the two APIs might be useful (e.g. like ignite does for pytorch -- which both share core contributors). That seems like a reasonable direction to push towards.

hammer · 2020-07-09T13:21:33Z

I'm +1 for a core API based on Xarray and higher-level functions/classes that hide it.

Ah, my first opportunity to quote Design principles for a new GWAS Toolkit:

Built with and conforming to the code style and extension APIs of popular libraries from the PyData ecosystem such as NumPy, Pandas, Dask, Xarray, Zarr, and others.

In particular, we should not have to build our own data frame or n-dimensional array APIs. We should inherit those APIs from upstream libraries.

I really want our users to be able to manipulate tabular and array-shaped data with domain-agnostic APIs from pandas and xarray.

I also support a clear separation between the core API and the high-level APIs, as different user communities may want different high-level APIs, as exhibited by the existence of both ignite and skorch.

tomwhite · 2021-03-02T17:03:28Z

I think we should add some helper functions to provide some of the region queries discussed in this issue, since they are non-trivial to do with the Xarray API.

There might also be a case for implementing a tabix-like index on top of Zarr. While Xarray indexing can achieve the same things that tabix can, it's not as efficient, since i) it has to create the index from scratch each time (I believe), ii) every position is stored in memory, and iii) region queries don't take advantage of hierarchical structure.

It might be possible to store tabix-like information (genomic position to record offsets for a subset of records) in the Zarr or Zarray metadata, and use that to implement fast indexing. I'm not sure how easy this would be to integrate with Pandas indexes (they have a pretty complex API), but it's something we could use in region query functions provided in sgkit.

jeromekelleher · 2021-03-02T18:11:21Z

Are you thinking of a full interval search @tomwhite? This is what we'd need in general, I think, but it depends on whether we're thinking about the VCF semantics of indels, etc. That would be a significant job.

If we're just looking at the start positions of variants then hopefully the upstream indexes will do the trick.

tomwhite · 2021-03-03T09:58:09Z

@jeromekelleher, yes - and more generally something that is fast using tabix is slower in sgkit. I haven't tried timing anything, it was just a concern that we may need to flag this, and have a potential way of solving it (e.g. something like a region query function that we can improve over time).

alimanfoo mentioned this issue Jul 8, 2020

Pysnptools reader implementation sgkit-dev/sgkit-plink#1

Merged

hammer added the data representation Issues related to how data is represented: data types, data structures, indexes, access methods, etc label Jul 9, 2020

This was referenced Jul 10, 2020

Allele count implementation #36

Merged

GWAS linear regression implementation #16

Merged

eric-czech mentioned this issue Aug 6, 2020

[WIP] Docs describing the Genotype Call XArray #78

Closed

tomwhite mentioned this issue Jan 5, 2021

Genomic coordinate systems #434

Open

tomwhite mentioned this issue Aug 31, 2021

Implement region queries #658

Closed

ivirshup mentioned this issue Nov 4, 2022

Allow start==stop for point data open2c/bioframe#127

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Genome region query API #25

Genome region query API #25

alimanfoo commented Jul 8, 2020

alimanfoo commented Jul 8, 2020

alimanfoo commented Jul 8, 2020

alimanfoo commented Jul 8, 2020

alimanfoo commented Jul 8, 2020 •

edited

Loading

eric-czech commented Jul 8, 2020

jeromekelleher commented Jul 8, 2020 •

edited

Loading

eric-czech commented Jul 8, 2020

jeromekelleher commented Jul 8, 2020

eric-czech commented Jul 8, 2020

alimanfoo commented Jul 8, 2020

eric-czech commented Jul 9, 2020

hammer commented Jul 9, 2020

tomwhite commented Mar 2, 2021

jeromekelleher commented Mar 2, 2021

tomwhite commented Mar 3, 2021

Genome region query API #25

Genome region query API #25

Comments

alimanfoo commented Jul 8, 2020

alimanfoo commented Jul 8, 2020

alimanfoo commented Jul 8, 2020

alimanfoo commented Jul 8, 2020

alimanfoo commented Jul 8, 2020 • edited Loading

eric-czech commented Jul 8, 2020

jeromekelleher commented Jul 8, 2020 • edited Loading

eric-czech commented Jul 8, 2020

jeromekelleher commented Jul 8, 2020

eric-czech commented Jul 8, 2020

alimanfoo commented Jul 8, 2020

eric-czech commented Jul 9, 2020

hammer commented Jul 9, 2020

tomwhite commented Mar 2, 2021

jeromekelleher commented Mar 2, 2021

tomwhite commented Mar 3, 2021

alimanfoo commented Jul 8, 2020 •

edited

Loading

jeromekelleher commented Jul 8, 2020 •

edited

Loading