Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genome region query API #25

Open
alimanfoo opened this issue Jul 8, 2020 · 15 comments
Open

Genome region query API #25

alimanfoo opened this issue Jul 8, 2020 · 15 comments
Labels
data representation Issues related to how data is represented: data types, data structures, indexes, access methods, etc

Comments

@alimanfoo
Copy link
Collaborator

Raising this issue to discuss API for selecting data from a given genome region, which could be either a whole contig or a contiguous region within a contig.

Breaking this out from discussion thread here.

@alimanfoo
Copy link
Collaborator Author

A few options that come to mind:

  • Using an accessor:
ds.genetics.sel_contig(slice('chr1'))
  • Accessor specific to functions on attributes:
ds.sel(variant=ds['variant/contig'] == ds.meta.contig_index('chr1'))
  • Use contig index directly:
ds.sel(variant=ds['variant/contig'] == 0) # for chr1

I think it makes more sense to focus on the situation with an index though, which is a bit more clear on it's own IMO:

contig_index = ds.attrs['contigs'].index('chr1')
contig_range = slice(contig_index, contig_index)
# or contig_range = slice(0, 0) for chr1
bp_pos_range = slice(0, 1000000)
ds.set_index(variant=('variant/contig', 'variant/pos')).sel(variant=(contig_range, bp_pos_range))

Any thoughts/preferences on those?

FWIW I don't know the best API but as a user I'd like to never have to use contig indices. That creates indirection that makes code harder to read and understand. Particularly where you have contig IDs that don't have a natural sequential order (e.g., in Anopheles we have "2L", "2R", "3L", "3R", "X").

Ideally I'd like a simple method or function where I can pass in three arguments: the contig name, and the start and stop positions of the region (1-based, stop-inclusive). Like the Python equivalent of a tabix query.

@alimanfoo
Copy link
Collaborator Author

Btw apologies in advance if I get opinionated here or elsewhere. I just know I'm going to end up having to explain this to others, and it will really help to have an API that is just obvious and natural from a geneticist point of view. If ever I express a strong opinion please feel free to push back 😄

@alimanfoo
Copy link
Collaborator Author

  • Using an accessor:
ds.genetics.sel_contig(slice('chr1'))

FWIW general API design opinion, as a user I don't like xarray accessors. It's an odd piece of programming that is hard to explain to new Python users. There should only be plain functions, or classes with methods, IMO. I.e., either sgkit.do_something(ds, args) or ds.do_something(args).

@alimanfoo
Copy link
Collaborator Author

alimanfoo commented Jul 8, 2020

I think it makes more sense to focus on the situation with an index though, which is a bit more clear on it's own IMO:

contig_index = ds.attrs['contigs'].index('chr1')
contig_range = slice(contig_index, contig_index)
# or contig_range = slice(0, 0) for chr1
bp_pos_range = slice(0, 1000000)
ds.set_index(variant=('variant/contig', 'variant/pos')).sel(variant=(contig_range, bp_pos_range))

For a user, having to make a slice to express a single contig is awkward. I know this is trying to work within the capabilities of the native xarray API, but if that is awkward for the user then we should make better functions or methods.

E.g., as a user I'd like to be able to do something like this to select a whole contig:

ds.sel_contig('chr1') 

...or this to select a region within a contig:

ds.sel_contig_region('chr1', 1, 1_000_000)

@eric-czech
Copy link
Collaborator

It doesn't have to be a slice btw, scalars are ok:

contig_index = ds.attrs['contigs'].index('chr1')
ds.set_index(variant=('variant/contig', 'variant/pos')).sel(variant=(contig_index, slice(0, 100000)))

Maybe there's a need for a module acting like what ignite is to pytorch or keras to tensorflow, or just more functions in the top level since subclassing is not a good choice. It seems like trying to hide xarray completely from users would be to their detriment, but I can certainly understand what you mean if they don't even necessarily know python.

@jeromekelleher
Copy link
Collaborator

jeromekelleher commented Jul 8, 2020

I like Alistair's thinking here. What about

ds["chr1"] # All the variants in ch1
ds["chr1", 1234] # variant at position 1234 on chr1
ds["chr1", 1234:5678] # Variants from 1234 to 5678 (we can argue about start-stop semantics later!)
ds[1234:5679] # Checks the dataset contains only one contig, and returns variatns from 1234:5678

This should be possible with a custom __getitem__, right? There'd be some fiddling with deciding what types are provided and so on, but it could be made to work.

@eric-czech
Copy link
Collaborator

I think it would be nice to override Datasets too, but Xarray won't preserve our subclass through some parts of the API since they recreate them often internally (one of the reasons they recommend against sublcassing). Are you guys thinking of an extension mechanism beyond accessors or subclasses? Maybe I'm missing something.

@jeromekelleher
Copy link
Collaborator

I think it would be nice to override Datasets too, but Xarray won't preserve our subclass through some parts of the API since they recreate them often internally (one of the reasons they recommend against sublcassing). Are you guys thinking of an extension mechanism beyond accessors or subclasses? Maybe I'm missing something.

I'm not really thinking about xarray at all, so I'm missing lots of important technical details! Think of the above as "this is what I'd like as a user". Maybe it's not practical.

@eric-czech
Copy link
Collaborator

Ah I see, those all sound good to me then. I can't immediately see a nice way to do it without switching to a wrapper class around a Dataset, but perhaps it acts more like a "view" of some kind that isn't central to the other parts of the API (details for later).

@alimanfoo
Copy link
Collaborator Author

I'm not really thinking about xarray at all, so I'm missing lots of important technical details! Think of the above as "this is what I'd like as a user". Maybe it's not practical.

DItto, I have no experience of xarray so may be unaware of important constraints. All I know about xarray so far is what you've taught me :-)

FWIW I definitely would agree we don't want to hide xarray. Part of sgkit is introducing users to the key tools in the pydata ecosystem, and so hopefully people will learn something about xarray (and numpy and dask) through using sgkit.

But I don't want to compromise on usability either.

Not quite sure what the best direction to push here is.

@eric-czech
Copy link
Collaborator

But I don't want to compromise on usability either.

Fair enough! FWIW, I'm +1 for a core API based on Xarray and higher-level functions/classes that hide it. I think any Xarray-free APIs should always use our core Xarray API under the hood, and that making a clear distinction between the two APIs might be useful (e.g. like ignite does for pytorch -- which both share core contributors). That seems like a reasonable direction to push towards.

@hammer
Copy link
Contributor

hammer commented Jul 9, 2020

I'm +1 for a core API based on Xarray and higher-level functions/classes that hide it.

Ah, my first opportunity to quote Design principles for a new GWAS Toolkit:

Built with and conforming to the code style and extension APIs of popular libraries from the PyData ecosystem such as NumPy, Pandas, Dask, Xarray, Zarr, and others.

In particular, we should not have to build our own data frame or n-dimensional array APIs. We should inherit those APIs from upstream libraries.

I really want our users to be able to manipulate tabular and array-shaped data with domain-agnostic APIs from pandas and xarray.

I also support a clear separation between the core API and the high-level APIs, as different user communities may want different high-level APIs, as exhibited by the existence of both ignite and skorch.

@hammer hammer added the data representation Issues related to how data is represented: data types, data structures, indexes, access methods, etc label Jul 9, 2020
@tomwhite
Copy link
Collaborator

tomwhite commented Mar 2, 2021

I think we should add some helper functions to provide some of the region queries discussed in this issue, since they are non-trivial to do with the Xarray API.

There might also be a case for implementing a tabix-like index on top of Zarr. While Xarray indexing can achieve the same things that tabix can, it's not as efficient, since i) it has to create the index from scratch each time (I believe), ii) every position is stored in memory, and iii) region queries don't take advantage of hierarchical structure.

It might be possible to store tabix-like information (genomic position to record offsets for a subset of records) in the Zarr or Zarray metadata, and use that to implement fast indexing. I'm not sure how easy this would be to integrate with Pandas indexes (they have a pretty complex API), but it's something we could use in region query functions provided in sgkit.

@jeromekelleher
Copy link
Collaborator

Are you thinking of a full interval search @tomwhite? This is what we'd need in general, I think, but it depends on whether we're thinking about the VCF semantics of indels, etc. That would be a significant job.

If we're just looking at the start positions of variants then hopefully the upstream indexes will do the trick.

@tomwhite
Copy link
Collaborator

tomwhite commented Mar 3, 2021

@jeromekelleher, yes - and more generally something that is fast using tabix is slower in sgkit. I haven't tried timing anything, it was just a concern that we may need to flag this, and have a potential way of solving it (e.g. something like a region query function that we can improve over time).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data representation Issues related to how data is represented: data types, data structures, indexes, access methods, etc
Projects
None yet
Development

No branches or pull requests

5 participants