-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add mean imputation function #892
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @tszfungc!
The overall structure looks fine to me. Hoping @jeromekelleher and @timothymillar can take a look too.
sgkit/stats/preprocessing.py
Outdated
Dataset containing the variable to be imputed. | ||
variable | ||
Input variable name | ||
``f"{variable}"`` and ``f"{variable}_masked"`` must be present in ``ds``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think f-strings work here?
Thanks for looking into this @tszfungc! I think this could be a great approach for imputing The values in |
Thanks for the review @tomwhite @timothymillar. I agree that the allele order doesn't have a particular meaning. The order along |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach basically looks good to me, but I'm not convinced about the general approach of creating new _imputed
variables. I would be simpler/better to just replace the missing data and reset the missingness mask in the returned dataset I think.
dim: Union[Hashable, Sequence[Hashable]] = "samples", | ||
merge: bool = True, | ||
) -> Dataset: | ||
"""Mean impute a masked variable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to give a more descriptive follow up sentence here, like say
This replaces missing data for the specified variable with the mean of the non-missing values.
@@ -214,6 +214,15 @@ def _check_field( | |||
) | |||
) | |||
|
|||
call_dosage_imputed, call_dosage_imputed_spec = SgkitVariables.register_variable( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we want to create a whole new bunch of variables here. Wouldn't it be simpler if we returned a copy of the original dataset in which all the missing data for the variable in question was replaced with the mean, and the mask was unset?
This would be more useful for downstream work, wouldn't it? We'd surely want to use the (say) imputed call_dosage
in downstream analyses, and we wouldn't want to need to change variable names in order to do this.
@jeromekelleher the trade-off between returning new variables or replacing existing variables was previously discussed in https://github.com/pystatgen/sgkit/pull/308#issuecomment-705706571. I personally have a slight preference for replacing existing variables but there are some good points raised in that discussion. The primary concern seems to be that replacing existing variables is effectively a mutate operation, which goes against the general pattern of treating arrays as immutable. |
I see, thanks. Hmm, not much choice other than to create a bunch of new variables then. |
This PR has conflicts, @tszfungc please rebase and push updated version 🙏 |
This PR has conflicts, @tszfungc please rebase and push updated version 🙏 |
This PR has conflicts, @tszfungc please rebase and push updated version 🙏 |
This PR has conflicts, @tszfungc please rebase and push updated version 🙏 |
This PR has conflicts, @tszfungc please rebase and push updated version 🙏 |
This PR has conflicts, @tszfungc please rebase and push updated version 🙏 |
ref #609
Add mean impute function for
call_dosage
,call_genotype
, andcall_genotype_probability