-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GWAS linear regression implementation #16
Conversation
High-level comment - I like the style you are using here of passing in the whole xarray ( |
Re converting genotype calls to dosage, I think we need a separate function from Re API, having to go via
...? The Re |
In using the name "dosage" there, I was trying to straddle a few possibilities and I think most of them won't align well with allele counting. Another reasonable name might have been "features" or "encoding" since we'll likely want to separate how alleles are counted from how variants are encoded in a GWAS. Additive encodings ignoring sex chromosomes could use a shared allele counting function but either way, I think they're going to diverge enough that it's not worth trying to factor out the common parts yet. I am running into the need for allele counting in some other work though. I was thinking that functions to create variables like 'call/allele_count' and 'variant/allele_count' would both be reasonable. Have you sketched anything out for that by chance @alimanfoo? i.e. following up on https://github.com/pystatgen/sgkit/issues/3#issuecomment-654186782.
I'm somewhat strongly in favor of an API that doesn't make in-place modifications to the dataset given how straightforward the xarray merge/assign utilities are to use. I think it removes a good bit of complexity (especially for methods that add many fields) and should be familiar for Pandas users. Maybe @tomwhite has some thoughts on that one?
Sounds good to me. I'm happy with |
FWIW the assign syntax looked painful to me, there's a lot that would need explaining to a new user. Bear in mind many scikit-allel users were also new to Python, so tricks like unpacking a dict can cause people to stumble. Also making the user specify the dimension names here seemed unnecessary given that we know what the dimensions will be. Also a load of typing! What about if you could do
If pulling up to the root namespace then the function name might need a little disambiguation, e.g., |
I may be wrongly conflating dosage and allele counting, I raised #21 to discuss but happy to take advice on naming and scope. Re allele counting, the suggestion to have functions that can create 'call/allele_count' and 'variant/allele_count' sounds great. Would this be the same or different functions? |
Another option is to always use import xarray as xr
def count_alleles(ds):
return ds['call/genotype'].sum(dim='ploidy').rename('call/allele_count')
ds = ds.merge(count_alleles(ds))
# now 'call/allele_count' is in ds This would mean that our functions could return |
An additional note on that, it also appears to work if the dataset is indexed and the array/dataset to be appended isn't, i.e.: def count_alleles(ds):
return ds['call/genotype'].sum(dim='ploidy').rename('call/allele_count')
cts = count_alleles(ds) # this has no index
ds.set_index(variants=('variant/contig', 'variant/position')).merge(cts)
# cts is aligned correctly when appended That's great because I thought we might have to worry about copying indexes into results. |
(To disambiguate the different types of allele count, I'm going to assume for the moment the function I'ts nice to know about merge, but still wondering if there is a more obvious way to do it. As a user, I think I'd want to be able to do something like either:
...or:
The first option above would be very pandas-y, like assigning a new column to a dataframe, we use that kind of syntax often. Also I'm often doing per-population allele counts, where I think I'd like to be able to do something like this:
Btw we should probably move this discussion to #3, somewhat off the main topic this PR :-) |
FYI assignment is also fine, so if |
…tput variable names
Is there anything else you guys would like me to rethink for this (cc: @tomwhite / @alimanfoo / @jeromekelleher)? I cleaned up a few things and tried to adapt to some conventions elsewhere. One wart left is that statsmodels throws a warning which I'm catching in code on import. Maybe there's a better way to do that? I'm happy to change the exported name too. My minor gripe with |
@eric-czech in your example it would be: ds: SgkitDataset = sgkit_bgen.read_bgen(...)
# here the user can do whatever they like with the underlying dataset, including corrupting it
# which we can detect in the `from_dataset`, recreating dataset should be cheap in comparison
# to actual computation
ds = SgkitDataset.from_dataset(ds.dataset.sel(variant=ds.dataset['variant/contig'].isin(range(1, 23)))
# alternatively more fluent:
ds = ds.with_dataset(lambda dataset: dataset.sel(variant=ds.dataset['variant/contig'].isin(range(1, 23)))
ds = sgkit.ld_prune(ds) So yes - the same thing as your example. This could be streamlined with fluent API (which I'm not suggesting we should do at this point, but for the reference): (sgkit_bgen
.read_bgen(...)
.with_dataset(lambda dataset: dataset.sel(variant=ds.dataset['variant/contig'].isin(range(1, 23)))
.ld_prune(ds)
) But filtering to autosomes should be relatively common, so we should have a function for that: ds: SgkitDataset = sgkit_bgen.read_bgen(...)
ds = sgkit.filter_autosomes(ds)
ds = sgkit.ld_prune(ds) or more complicated example: def my_custom_reusable_imputation_pipeline(ds: SgkitDataset) -> SgkitDataset:
# my custom ds manipulation - strictly expects a valid SgkitDataset and type safe
...
def my_custom_ld_pruning(ds: xr.Dataset) -> xr.Dataset:
# my custom XR manipulation
...
ds: SgkitDataset = sgkit_bgen.read_bgen(...)
ds = sgkit.filter_autosomes(ds)
ds = my_custom_reusable_imputation_pipeline(ds)
ds = ds.with_dataset(my_custom_ld_pruning(ds.dataset))
ds = sgkit.pc_relate(ds, pcs) I believe all these examples can be type safe, with code completion and data validation - which would be the benefits of encapsulation. WDYT? |
Makes sense to me, though I think type safe will need to mean that the constructor validation is run on every call to The majority of methods only need a part of what's in an SgkitDataset too so I think requiring meeting the full contract would be quite limiting -- or at least that's something I've found to be pretty irksome about the standard encapsulation approach in other scientific libs we've used (especially in R). Can you see some way around this? That would be my biggest protest against a wrapper that just does validation. |
Just piping in here - this is a great discussion. I don't have anything to add, but it's great to see the details being thought through here. |
I wonder if value-based checks should be made explicit - in either style of API - since they may not be cheap. User documentation could include them as a way of demonstrating good practice. Pipelines generally have QC upfront, so sprinkling some validation checks in there should sit quite naturally. Type, dimension and variable name checks on the other hand are very cheap, and should be done internally in every method call as precondition checks. Constructors should check all variables in a dataset, other methods should only check the fields they rely on. |
@eric-czech yep the validation would need to run on each ctor/boxing but we could mitigate that in a number of ways - flags to control the depth of the validation, @tomwhite 's suggestion (which I'm +1 to). Regarding partial |
Having more concrete classes to handle different cases makes sense to me but I don't see how allowing a class to do a partial validation on whatever data was provided would work well. For example, how could you know what to validate here? type(ds) # SgkitDataset
# This is OK, the first with_dataset can ignore 'variant/contig' on validation
ds.with_dataset(lambda ds: ds.drop('variant/contig')).with_dataset(sgkit.count_alleles)
# Not OK, this needs to fail in a validation somewhere since contig is not optional
ds.with_dataset(lambda ds: ds.drop('variant/contig')).with_dataset(sgkit.ld_prune) I think you could drop down to Xarray, do all the mutations, and then box the type again with fields to check at the end with something like I'm assuming by more concrete classes you mean there could be something like a "MinimalSgkitDataset" type that only enforces some fields, but once you start crossing all of the different inputs with GWAS operations and scikit-allel functions, the number of unique combinations of required fields is going to be pretty large (and hard to name concrete classes for). My concern is that it's easy to spec this out given only what we've attempted so far in sgkit and that once we start growing the applications, we'll have overly general types because it's too hard to model them more granularly. To be dramatic about it (lol), I don't want to end up with an oppressive type system that we live with just because it will be easy to say we should always err on the side of too much validation. In the end I think it would be great if you wanted to propose a new data structures API, but I would request that it answer two questions:
|
So I'm suggesting keeping it relatively simple and type safe at the beginning, and as we grow more use cases we can (re)evaluate if we need more classes or maybe a change of tack. No solution will be a silver bullet obviously. How would that right now look like in your example: ds_no_contig = ds.with_dataset(lambda ds: ds.drop('variant/contig'))
# This is OK, the first with_dataset can ignore 'variant/contig' on validation
sgkit.count_alleles(ds_no_contig)
# Not OK, this needs to fail in a validation somewhere since contig is not optional
sgkit.ld_prune(ds_no_contig) Building upon @tomwhite comment, we could have:
A fair question would be how is this better than just doing checks on XRs passed into methods, and I would argue:
I see it as a middle ground between oppressive type system and flexibility with potentially for improvement as we build upon more use cases. And again I don't want to block this PR (or anyone else), if anything all the work should continue with the current assumptions and if we want to evaluate this idea, we can make all the necessary changes later. If anyone feels strongly against this idea, we/I can drop it :) Otherwise I can also sketch out a PR with the API above, which would be a simple place to start. What do you think @eric-czech , @tomwhite ? |
What kind of flags/methods are you imagining for that one? I like that decoupling the more I think about it -- that could be a nice way to deal with repetitive patterns in analysis without having to push it into class definitions. |
FWIW I feel that a wrapper needs to be useful and compelling enough to pull its weight. I'm +1 on @ravwojdyla sketching out something in a separate PR (that we may never adopt :), while this one is merged with the Xarray API. What still needs doing on this one @eric-czech? |
Nothing I think? Looking back through our conversations I don't think I missed any of the easier/concrete suggestions. I'm sure it will need some refactoring at some point but it shouldn't interfere with much else in the meantime. |
FYI I don't know why the status still shows that there are conflicts. I tried using the web editor to resolve them twice but it failed both times and told me to restart so I did it on a command line instead. I fixed the little conflict that there was and pushed the clean commit to this same branch so I'm not sure why those commits aren't showing up here or why github still thinks there are conflicts. Maybe using the web editor screwed something up? Has anybody else run into this before? |
@eric-czech I think that would likely be case/type specific validation methods - like say Also btw - currently GH says there are not conflicts but the build has failed. |
Ah I see, I guess it was just being slow. |
Hi folks, just to say I raised an issue (#43), thought it worth having a dedicated place for continuing discussion of use of classes and methods and related issues started here. I haven't put much of an issue description in yet, feel free to try and frame the problem concisely :-) |
Hey @ravwojdyla / @tomwhite, could one of you take a peek when you get a chance at my mypy fixes in pystatgen/sgkit@f2c0352 and officially approve if this is looking OK now? |
|
||
def test_linear_regression_statistics(ds): | ||
def validate(dfp: DataFrame, dft: DataFrame) -> None: | ||
print(dfp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these print
stmts intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was planning to do that often with these method tests since they'll only be shown when there's an error, and I don't often find smart diffs from assertion utils (e.g. np.testing) to be as good for debugging as simply seeing the datasets when they're small. I take it as a somewhat acceptable convention since I've seen it in Xarray/Dask tests, and haven't seen any downsides other than that stdout is a mess if not captured by pytest (can't think of why that would ever be necessary though).
This adds a method for calculating variant association statistics with a single continuous phenotype. This formulation is equivalent to standard OLS methods (lm in R, statsmodels in Python, etc.) but the lift here is in expressing this for a large number of individual regressions as matrix operations where some of the covariates are constant, and without ever requiring computation of an (n_samples x n_samples) matrix.
I think validating this on real data won't be very difficult but I'm going to hold off on writing anything for it until some of the core parts of the code stabilize a bit. This should be a good example to help frame the discussion in https://github.com/pystatgen/sgkit/issues/11 more.
Basic usage would look something like this:
I created a
stats
subpackage and put myassociation.py
module in it but I don't know how we want to expose that --sgkit.stats.linear_regression
perhaps? Let me know if anybody has objections to that direction.