Update freq fields kept in each dataset prior to merge #740

mike-w-wilson · 2025-12-18T14:54:29Z

This brings some logic I would put in the merge back into the processing of the first two datasets to limit the data we are storing. It also standardizes the fields and makes the merge simpler.

ch-kr

just a couple minor comments. I didn't run the process-gnomad version yet since that seems likely to change

gnomad_qc/v5/annotations/generate_frequency.py

ch-kr · 2025-12-18T18:52:35Z

gnomad_qc/v5/annotations/generate_frequency.py

+    final_fields = ["freq", "histograms"]
+
+    if dataset == "gnomad":
+        logger.info("Dropping 'subsets' from gnomAD freq ht array annotations...")


adding a comment here to make sure it is tracked here as well: we chatted about this separately on slack and decided that it was easier to use the public genomes release HT as input rather than the private v4 genomes freq HT. this way, we'll maintain the same strata for the v5 gnomAD genomes as the v4 genomes release

ch-kr · 2025-12-18T19:20:11Z

gnomad_qc/v5/annotations/generate_frequency.py

+
+    # Convert all int64 annotations in the freq struct to int32s for merging type
+    # compatibility.
+    ht = ht.annotate(


this conversion also occurs in _merge_updated_frequency_fields; maybe it only needs to happen in this function, since this function gets run for both gnomad and aou?

Correct! than kyou

mike-w-wilson · 2025-12-18T21:02:04Z

Back to you @ch-kr ! I updated to release and added the ploidy adjustment like we discussed. Test is here: https://console.cloud.google.com/dataproc/jobs/09bec68042a94bcdad6d8a75eddc2fc8/summary?region=us-central1&project=broad-mpg-gnomad&supportedpurview=project for gnomad if you wanted to check it out

ch-kr

another minor comment and a question

ch-kr · 2025-12-18T21:30:42Z

gnomad_qc/v5/annotations/generate_frequency.py

-    )
+
    # Update globals from updated table.
    updated_globals = {}


sorry another small thing I just remembered for the v4 HT is that the age global annotation is incorrect (https://the-tgg.slack.com/archives/C06H5KM9W64/p1761665354397189), so we will want to fix that here

ch-kr · 2025-12-18T21:33:23Z

gnomad_qc/v5/annotations/generate_frequency.py

    )
-
-    # Add adj annotation required by annotate_freq.
+    logger.info("Annotating adj...")


maybe we should add a note here that this is follows the same order as previous versions (annotating adj after splitting multi) but we'll need to move this annotation if we don't want to densify for freq calculations. or maybe we should add the use-all-sites-ans arg to this function to toggle behavior as needed?

mike-w-wilson self-assigned this Dec 18, 2025

mike-w-wilson requested a review from a team as a code owner December 18, 2025 14:54

mike-w-wilson added 8 commits December 18, 2025 10:39

Create select_final_dataset_freq_field function for reuse

26ed4cc

Add age distribution to aou globals

7f34b7a

Add env to group membership resource

8f26bca

Pass env for annotation resources

37229ce

List not a set to expand

048f5bc

Add downsamplings to aou globals

6eb6e1b

Add env to get_freq

bc832a6

Add naive coalesce to test gnomad freq

fd20c8a

mike-w-wilson force-pushed the mw/update_freq_fields_kept branch from 5da1f7e to fd20c8a Compare December 18, 2025 15:41

mike-w-wilson requested a review from ch-kr December 18, 2025 15:41

mike-w-wilson assigned ch-kr Dec 18, 2025

mike-w-wilson added 2 commits December 18, 2025 13:57

Replace v4_freq with release HT to avoid filtering v4 freq

4f48cea

Adjust aou sex ploidy for good measure

e5af956

ch-kr reviewed Dec 18, 2025

View reviewed changes

mike-w-wilson added 4 commits December 18, 2025 14:58

Move split to top so adj happens after

7f83984

Adding logging statements

f8c937a

Update order in prep aou

e167e65

Replace AD, whoops

2300307

ch-kr reviewed Dec 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update freq fields kept in each dataset prior to merge #740

Update freq fields kept in each dataset prior to merge #740

Uh oh!

mike-w-wilson commented Dec 18, 2025 •

edited

Loading

Uh oh!

ch-kr left a comment

Uh oh!

Uh oh!

ch-kr Dec 18, 2025

Uh oh!

ch-kr Dec 18, 2025

Uh oh!

mike-w-wilson Dec 18, 2025

Uh oh!

mike-w-wilson commented Dec 18, 2025 •

edited

Loading

Uh oh!

ch-kr left a comment

Uh oh!

ch-kr Dec 18, 2025

Uh oh!

ch-kr Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Update freq fields kept in each dataset prior to merge #740

Are you sure you want to change the base?

Update freq fields kept in each dataset prior to merge #740

Uh oh!

Conversation

mike-w-wilson commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ch-kr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ch-kr Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

ch-kr Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

mike-w-wilson Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

mike-w-wilson commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ch-kr left a comment

Choose a reason for hiding this comment

Uh oh!

ch-kr Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

ch-kr Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mike-w-wilson commented Dec 18, 2025 •

edited

Loading

mike-w-wilson commented Dec 18, 2025 •

edited

Loading