Relatedness, no Dataproc #870

MattWellie · 2024-08-02T15:00:42Z

Concept PR, related to #869

Takes the Relatedness Stage of the large_cohort pipeline, and tries to remove it from Dataproc

The Relatedness Stage as it sits in main is a little convoluted

one entrypoint (run) takes a MT and a HT as input, and makes 2 HTs as output
The PCRelate stage runs first (uses one MT, makes one HT), and is optionally skipped if the result HT already exists as a checkpoint (here)
The Sample Flagging stage runs next, taking the PCRelate HT and a SampleQC HT as input, making another HT

This PR splits this into two Stages - RelatednessPCRelate and RelatednessFlag, each running the separate part of that larger process.

For each of these steps there's a separate script, and a Stage.
The script sets up a Hail runtime, runs the code, and writes the output locally
The Stage wrapper creates a VM, copys the data into the VM, runs the script, and copys the output back.

This could all be one Stage, just as it's implemented at the moment, but breaking it up this way seems logical enough as the two methods create distinct output, and we keep both outputs separately in GCP. It also makes the interface easier, instead of having to copy multiple tables in/out of one job.

…crease storage

michael-harper · 2024-08-05T23:44:30Z

cpg_workflows/stages/large_cohort.py

+            tshirt_mt_sizing(
+                sequencing_type=config_retrieve(['workflow', 'sequencing_type']),
+                cohort_size=len(cohort.get_sequencing_group_ids()),
+            )
+            * 2


Just tried kicking off a test batch which failed because tshirt_mt_sizing returns a string, so 50Gi * 2 becomes 50Gi50Gi 😅

Suggested change

tshirt_mt_sizing(

sequencing_type=config_retrieve(['workflow', 'sequencing_type']),

cohort_size=len(cohort.get_sequencing_group_ids()),

)

* 2

t_shirt_size_value = tshirt_mt_sizing(

sequencing_type=config_retrieve(['workflow', 'sequencing_type']),

cohort_size=len(cohort.get_sequencing_group_ids()),

).split('Gi')[0]

required_storage_value = int(t_shirt_size_value) * 2

required_storage = f'{required_storage_value}Gi'

…mics/production-pipelines into relatedness_no_dp

michael-harper · 2024-08-06T02:29:37Z

Ran some tests using tenk10k data in bioheart-test. The results seem to be the same between DP and no-DP
Relatedness PC Relate batch and Relatedness Flag (noting that metamist registration did not work)

MattWellie · 2024-08-06T04:19:22Z

metamist registration did not work

I assumed a 'mt' analysisType existed, but... maybe it doesn't. That seems to be the foreign key error it's failing on

MattWellie added 3 commits August 2, 2024 14:27

remove Relatedness from DataProc

4bdc06f

LINT

245a9a0

Bump version: 1.25.27 → 1.26.0

ce1b59b

MattWellie requested review from katiedelange, KatalinaBobowik, vivbak and michael-harper August 2, 2024 15:00

MattWellie added 2 commits August 5, 2024 08:20

entrypoint correction

5b504d4

Merge branch 'main' into relatedness_no_dp

e36af9d

MattWellie added the large cohort Change is exclusively within the scope of the large cohort pipeline label Aug 5, 2024

michael-harper added 2 commits August 6, 2024 08:39

Checking what storage is requested

b842962

thisrt_mt_sizing returns a string, so you can't multiply by 'x' to in…

0714529

…crease storage

michael-harper reviewed Aug 5, 2024

View reviewed changes

MattWellie and others added 4 commits August 6, 2024 08:37

make scripts a module

51c25ed

Fixing storage request in RelatednessFlag

8e60c7f

Merge branch 'relatedness_no_dp' of https://github.com/populationgeno…

e3c0106

…mics/production-pipelines into relatedness_no_dp

Correcting path to sample_qc instead of relatedness ht

58f11ec

MattWellie added 2 commits August 6, 2024 12:19

Merge branch 'main' into relatedness_no_dp

dfdf99a

version methods for all Stages

aa0911f

MattWellie mentioned this pull request Aug 7, 2024

When is a DataProc not a DataProc? #869

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relatedness, no Dataproc #870

Relatedness, no Dataproc #870

MattWellie commented Aug 2, 2024 •

edited

Loading

michael-harper Aug 5, 2024

michael-harper commented Aug 6, 2024

MattWellie commented Aug 6, 2024

Relatedness, no Dataproc #870

Are you sure you want to change the base?

Relatedness, no Dataproc #870

Conversation

MattWellie commented Aug 2, 2024 • edited Loading

michael-harper Aug 5, 2024

Choose a reason for hiding this comment

michael-harper commented Aug 6, 2024

MattWellie commented Aug 6, 2024

MattWellie commented Aug 2, 2024 •

edited

Loading