Skip to content

Conversation

@mike-w-wilson
Copy link
Contributor

@mike-w-wilson mike-w-wilson commented Dec 18, 2025

This brings some logic I would put in the merge back into the processing of the first two datasets to limit the data we are storing. It also standardizes the fields and makes the merge simpler.

@mike-w-wilson mike-w-wilson self-assigned this Dec 18, 2025
@mike-w-wilson mike-w-wilson requested a review from a team as a code owner December 18, 2025 14:54
Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a couple minor comments. I didn't run the process-gnomad version yet since that seems likely to change

final_fields = ["freq", "histograms"]

if dataset == "gnomad":
logger.info("Dropping 'subsets' from gnomAD freq ht array annotations...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding a comment here to make sure it is tracked here as well: we chatted about this separately on slack and decided that it was easier to use the public genomes release HT as input rather than the private v4 genomes freq HT. this way, we'll maintain the same strata for the v5 gnomAD genomes as the v4 genomes release


# Convert all int64 annotations in the freq struct to int32s for merging type
# compatibility.
ht = ht.annotate(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this conversion also occurs in _merge_updated_frequency_fields; maybe it only needs to happen in this function, since this function gets run for both gnomad and aou?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct! than kyou

@mike-w-wilson
Copy link
Contributor Author

mike-w-wilson commented Dec 18, 2025

Back to you @ch-kr ! I updated to release and added the ploidy adjustment like we discussed. Test is here: https://console.cloud.google.com/dataproc/jobs/09bec68042a94bcdad6d8a75eddc2fc8/summary?region=us-central1&project=broad-mpg-gnomad&supportedpurview=project for gnomad if you wanted to check it out

Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another minor comment and a question

)

# Update globals from updated table.
updated_globals = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry another small thing I just remembered for the v4 HT is that the age global annotation is incorrect (https://the-tgg.slack.com/archives/C06H5KM9W64/p1761665354397189), so we will want to fix that here

)

# Add adj annotation required by annotate_freq.
logger.info("Annotating adj...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should add a note here that this is follows the same order as previous versions (annotating adj after splitting multi) but we'll need to move this annotation if we don't want to densify for freq calculations. or maybe we should add the use-all-sites-ans arg to this function to toggle behavior as needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants