Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mean imputation function #892

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Conversation

tszfungc
Copy link

ref #609

Add mean impute function for call_dosage, call_genotype, and call_genotype_probability

Copy link
Collaborator

@tomwhite tomwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @tszfungc!

The overall structure looks fine to me. Hoping @jeromekelleher and @timothymillar can take a look too.

Dataset containing the variable to be imputed.
variable
Input variable name
``f"{variable}"`` and ``f"{variable}_masked"`` must be present in ``ds``.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think f-strings work here?

@timothymillar
Copy link
Collaborator

Thanks for looking into this @tszfungc! I think this could be a great approach for imputing call_dosage and call_genotype_probability. However, I don't think it will produce the desired result for call_genotype.

The values in call_genotype are (potentially unsorted) alleles whose order along the ploidy dimension doesn't have any particular meaning. So, as far as I can tell, the mean of those alleles can't really be used for anything.

@tszfungc
Copy link
Author

Thanks for the review @tomwhite @timothymillar. I agree that the allele order doesn't have a particular meaning. The order along ploidy should be ignored by computing the mean along dim=['samples', 'ploidy'], But this is also an unusual use to me.

Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach basically looks good to me, but I'm not convinced about the general approach of creating new _imputed variables. I would be simpler/better to just replace the missing data and reset the missingness mask in the returned dataset I think.

dim: Union[Hashable, Sequence[Hashable]] = "samples",
merge: bool = True,
) -> Dataset:
"""Mean impute a masked variable
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to give a more descriptive follow up sentence here, like say

This replaces missing data for the specified variable with the mean of the non-missing values.

@@ -214,6 +214,15 @@ def _check_field(
)
)

call_dosage_imputed, call_dosage_imputed_spec = SgkitVariables.register_variable(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want to create a whole new bunch of variables here. Wouldn't it be simpler if we returned a copy of the original dataset in which all the missing data for the variable in question was replaced with the mean, and the mask was unset?

This would be more useful for downstream work, wouldn't it? We'd surely want to use the (say) imputed call_dosage in downstream analyses, and we wouldn't want to need to change variable names in order to do this.

@timothymillar
Copy link
Collaborator

@jeromekelleher the trade-off between returning new variables or replacing existing variables was previously discussed in https://github.com/pystatgen/sgkit/pull/308#issuecomment-705706571. I personally have a slight preference for replacing existing variables but there are some good points raised in that discussion. The primary concern seems to be that replacing existing variables is effectively a mutate operation, which goes against the general pattern of treating arrays as immutable.

@jeromekelleher
Copy link
Collaborator

I see, thanks. Hmm, not much choice other than to create a bunch of new variables then.

@mergify
Copy link
Contributor

mergify bot commented Mar 29, 2023

This PR has conflicts, @tszfungc please rebase and push updated version 🙏

@mergify mergify bot added the conflict PR conflict label Mar 29, 2023
@mergify mergify bot removed the conflict PR conflict label Sep 5, 2023
@mergify
Copy link
Contributor

mergify bot commented Sep 5, 2023

This PR has conflicts, @tszfungc please rebase and push updated version 🙏

@mergify mergify bot added the conflict PR conflict label Sep 5, 2023
@mergify mergify bot removed the conflict PR conflict label Nov 13, 2023
Copy link
Contributor

mergify bot commented Nov 13, 2023

This PR has conflicts, @tszfungc please rebase and push updated version 🙏

@mergify mergify bot added the conflict PR conflict label Nov 13, 2023
@mergify mergify bot removed the conflict PR conflict label Feb 5, 2024
Copy link
Contributor

mergify bot commented Feb 5, 2024

This PR has conflicts, @tszfungc please rebase and push updated version 🙏

@mergify mergify bot added the conflict PR conflict label Feb 5, 2024
@mergify mergify bot removed the conflict PR conflict label Jun 19, 2024
Copy link
Contributor

mergify bot commented Jun 19, 2024

This PR has conflicts, @tszfungc please rebase and push updated version 🙏

@mergify mergify bot added the conflict PR conflict label Jun 19, 2024
@mergify mergify bot removed the conflict PR conflict label Jan 14, 2025
Copy link
Contributor

mergify bot commented Jan 14, 2025

This PR has conflicts, @tszfungc please rebase and push updated version 🙏

@mergify mergify bot added the conflict PR conflict label Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
conflict PR conflict
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants