-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relatedness, no Dataproc #870
base: main
Are you sure you want to change the base?
Conversation
cpg_workflows/stages/large_cohort.py
Outdated
tshirt_mt_sizing( | ||
sequencing_type=config_retrieve(['workflow', 'sequencing_type']), | ||
cohort_size=len(cohort.get_sequencing_group_ids()), | ||
) | ||
* 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tried kicking off a test batch which failed because tshirt_mt_sizing
returns a string, so 50Gi
* 2 becomes 50Gi50Gi
😅
tshirt_mt_sizing( | |
sequencing_type=config_retrieve(['workflow', 'sequencing_type']), | |
cohort_size=len(cohort.get_sequencing_group_ids()), | |
) | |
* 2 | |
t_shirt_size_value = tshirt_mt_sizing( | |
sequencing_type=config_retrieve(['workflow', 'sequencing_type']), | |
cohort_size=len(cohort.get_sequencing_group_ids()), | |
).split('Gi')[0] | |
required_storage_value = int(t_shirt_size_value) * 2 | |
required_storage = f'{required_storage_value}Gi' |
Ran some tests using |
I assumed a 'mt' analysisType existed, but... maybe it doesn't. That seems to be the foreign key error it's failing on |
Concept PR, related to #869
Takes the
Relatedness
Stage of the large_cohort pipeline, and tries to remove it from DataprocThe Relatedness Stage as it sits in
main
is a little convolutedThis PR splits this into two Stages -
RelatednessPCRelate
andRelatednessFlag
, each running the separate part of that larger process.For each of these steps there's a separate script, and a Stage.
The script sets up a Hail runtime, runs the code, and writes the output locally
The Stage wrapper creates a VM, copys the data into the VM, runs the script, and copys the output back.
This could all be one Stage, just as it's implemented at the moment, but breaking it up this way seems logical enough as the two methods create distinct output, and we keep both outputs separately in GCP. It also makes the interface easier, instead of having to copy multiple tables in/out of one job.