Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relatedness, no Dataproc #870

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

Relatedness, no Dataproc #870

wants to merge 13 commits into from

Conversation

MattWellie
Copy link
Contributor

@MattWellie MattWellie commented Aug 2, 2024

Concept PR, related to #869

Takes the Relatedness Stage of the large_cohort pipeline, and tries to remove it from Dataproc

The Relatedness Stage as it sits in main is a little convoluted

  • one entrypoint (run) takes a MT and a HT as input, and makes 2 HTs as output
  • The PCRelate stage runs first (uses one MT, makes one HT), and is optionally skipped if the result HT already exists as a checkpoint (here)
  • The Sample Flagging stage runs next, taking the PCRelate HT and a SampleQC HT as input, making another HT

This PR splits this into two Stages - RelatednessPCRelate and RelatednessFlag, each running the separate part of that larger process.

For each of these steps there's a separate script, and a Stage.
The script sets up a Hail runtime, runs the code, and writes the output locally
The Stage wrapper creates a VM, copys the data into the VM, runs the script, and copys the output back.

This could all be one Stage, just as it's implemented at the moment, but breaking it up this way seems logical enough as the two methods create distinct output, and we keep both outputs separately in GCP. It also makes the interface easier, instead of having to copy multiple tables in/out of one job.

@MattWellie MattWellie added the large cohort Change is exclusively within the scope of the large cohort pipeline label Aug 5, 2024
Comment on lines 155 to 159
tshirt_mt_sizing(
sequencing_type=config_retrieve(['workflow', 'sequencing_type']),
cohort_size=len(cohort.get_sequencing_group_ids()),
)
* 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tried kicking off a test batch which failed because tshirt_mt_sizing returns a string, so 50Gi * 2 becomes 50Gi50Gi 😅

Suggested change
tshirt_mt_sizing(
sequencing_type=config_retrieve(['workflow', 'sequencing_type']),
cohort_size=len(cohort.get_sequencing_group_ids()),
)
* 2
t_shirt_size_value = tshirt_mt_sizing(
sequencing_type=config_retrieve(['workflow', 'sequencing_type']),
cohort_size=len(cohort.get_sequencing_group_ids()),
).split('Gi')[0]
required_storage_value = int(t_shirt_size_value) * 2
required_storage = f'{required_storage_value}Gi'

@michael-harper
Copy link
Contributor

Ran some tests using tenk10k data in bioheart-test. The results seem to be the same between DP and no-DP
Relatedness PC Relate batch and Relatedness Flag (noting that metamist registration did not work)
Screenshot 2024-08-06 at 12 27 55 PM

@MattWellie
Copy link
Contributor Author

metamist registration did not work

I assumed a 'mt' analysisType existed, but... maybe it doesn't. That seems to be the foreign key error it's failing on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
large cohort Change is exclusively within the scope of the large cohort pipeline
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants