-
Notifications
You must be signed in to change notification settings - Fork 28
Update freq fields kept in each dataset prior to merge #740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
5da1f7e to
fd20c8a
Compare
ch-kr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a couple minor comments. I didn't run the process-gnomad version yet since that seems likely to change
| final_fields = ["freq", "histograms"] | ||
|
|
||
| if dataset == "gnomad": | ||
| logger.info("Dropping 'subsets' from gnomAD freq ht array annotations...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding a comment here to make sure it is tracked here as well: we chatted about this separately on slack and decided that it was easier to use the public genomes release HT as input rather than the private v4 genomes freq HT. this way, we'll maintain the same strata for the v5 gnomAD genomes as the v4 genomes release
|
|
||
| # Convert all int64 annotations in the freq struct to int32s for merging type | ||
| # compatibility. | ||
| ht = ht.annotate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this conversion also occurs in _merge_updated_frequency_fields; maybe it only needs to happen in this function, since this function gets run for both gnomad and aou?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct! than kyou
|
Back to you @ch-kr ! I updated to release and added the ploidy adjustment like we discussed. Test is here: https://console.cloud.google.com/dataproc/jobs/09bec68042a94bcdad6d8a75eddc2fc8/summary?region=us-central1&project=broad-mpg-gnomad&supportedpurview=project for gnomad if you wanted to check it out |
ch-kr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another minor comment and a question
| ) | ||
|
|
||
| # Update globals from updated table. | ||
| updated_globals = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry another small thing I just remembered for the v4 HT is that the age global annotation is incorrect (https://the-tgg.slack.com/archives/C06H5KM9W64/p1761665354397189), so we will want to fix that here
| ) | ||
|
|
||
| # Add adj annotation required by annotate_freq. | ||
| logger.info("Annotating adj...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should add a note here that this is follows the same order as previous versions (annotating adj after splitting multi) but we'll need to move this annotation if we don't want to densify for freq calculations. or maybe we should add the use-all-sites-ans arg to this function to toggle behavior as needed?
This brings some logic I would put in the merge back into the processing of the first two datasets to limit the data we are storing. It also standardizes the fields and makes the merge simpler.