-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Genome region query API #25
Comments
FWIW I don't know the best API but as a user I'd like to never have to use contig indices. That creates indirection that makes code harder to read and understand. Particularly where you have contig IDs that don't have a natural sequential order (e.g., in Anopheles we have "2L", "2R", "3L", "3R", "X"). Ideally I'd like a simple method or function where I can pass in three arguments: the contig name, and the start and stop positions of the region (1-based, stop-inclusive). Like the Python equivalent of a tabix query. |
Btw apologies in advance if I get opinionated here or elsewhere. I just know I'm going to end up having to explain this to others, and it will really help to have an API that is just obvious and natural from a geneticist point of view. If ever I express a strong opinion please feel free to push back 😄 |
FWIW general API design opinion, as a user I don't like xarray accessors. It's an odd piece of programming that is hard to explain to new Python users. There should only be plain functions, or classes with methods, IMO. I.e., either |
For a user, having to make a slice to express a single contig is awkward. I know this is trying to work within the capabilities of the native xarray API, but if that is awkward for the user then we should make better functions or methods. E.g., as a user I'd like to be able to do something like this to select a whole contig:
...or this to select a region within a contig:
|
It doesn't have to be a slice btw, scalars are ok: contig_index = ds.attrs['contigs'].index('chr1')
ds.set_index(variant=('variant/contig', 'variant/pos')).sel(variant=(contig_index, slice(0, 100000))) Maybe there's a need for a module acting like what ignite is to pytorch or keras to tensorflow, or just more functions in the top level since subclassing is not a good choice. It seems like trying to hide xarray completely from users would be to their detriment, but I can certainly understand what you mean if they don't even necessarily know python. |
I like Alistair's thinking here. What about ds["chr1"] # All the variants in ch1
ds["chr1", 1234] # variant at position 1234 on chr1
ds["chr1", 1234:5678] # Variants from 1234 to 5678 (we can argue about start-stop semantics later!)
ds[1234:5679] # Checks the dataset contains only one contig, and returns variatns from 1234:5678 This should be possible with a custom |
I think it would be nice to override Datasets too, but Xarray won't preserve our subclass through some parts of the API since they recreate them often internally (one of the reasons they recommend against sublcassing). Are you guys thinking of an extension mechanism beyond accessors or subclasses? Maybe I'm missing something. |
I'm not really thinking about xarray at all, so I'm missing lots of important technical details! Think of the above as "this is what I'd like as a user". Maybe it's not practical. |
Ah I see, those all sound good to me then. I can't immediately see a nice way to do it without switching to a wrapper class around a Dataset, but perhaps it acts more like a "view" of some kind that isn't central to the other parts of the API (details for later). |
DItto, I have no experience of xarray so may be unaware of important constraints. All I know about xarray so far is what you've taught me :-) FWIW I definitely would agree we don't want to hide xarray. Part of sgkit is introducing users to the key tools in the pydata ecosystem, and so hopefully people will learn something about xarray (and numpy and dask) through using sgkit. But I don't want to compromise on usability either. Not quite sure what the best direction to push here is. |
Fair enough! FWIW, I'm +1 for a core API based on Xarray and higher-level functions/classes that hide it. I think any Xarray-free APIs should always use our core Xarray API under the hood, and that making a clear distinction between the two APIs might be useful (e.g. like ignite does for pytorch -- which both share core contributors). That seems like a reasonable direction to push towards. |
Ah, my first opportunity to quote Design principles for a new GWAS Toolkit:
I really want our users to be able to manipulate tabular and array-shaped data with domain-agnostic APIs from I also support a clear separation between the core API and the high-level APIs, as different user communities may want different high-level APIs, as exhibited by the existence of both ignite and skorch. |
I think we should add some helper functions to provide some of the region queries discussed in this issue, since they are non-trivial to do with the Xarray API. There might also be a case for implementing a tabix-like index on top of Zarr. While Xarray indexing can achieve the same things that tabix can, it's not as efficient, since i) it has to create the index from scratch each time (I believe), ii) every position is stored in memory, and iii) region queries don't take advantage of hierarchical structure. It might be possible to store tabix-like information (genomic position to record offsets for a subset of records) in the Zarr or Zarray metadata, and use that to implement fast indexing. I'm not sure how easy this would be to integrate with Pandas indexes (they have a pretty complex API), but it's something we could use in region query functions provided in sgkit. |
Are you thinking of a full interval search @tomwhite? This is what we'd need in general, I think, but it depends on whether we're thinking about the VCF semantics of indels, etc. That would be a significant job. If we're just looking at the start positions of variants then hopefully the upstream indexes will do the trick. |
@jeromekelleher, yes - and more generally something that is fast using tabix is slower in sgkit. I haven't tried timing anything, it was just a concern that we may need to flag this, and have a potential way of solving it (e.g. something like a region query function that we can improve over time). |
Raising this issue to discuss API for selecting data from a given genome region, which could be either a whole contig or a contiguous region within a contig.
Breaking this out from discussion thread here.
The text was updated successfully, but these errors were encountered: