Description
It appears that this function does not scale well when run on a cluster.
Notes from my most recent attempt:
- The code I ran is here: https://github.com/related-sciences/ukb-gwas-pipeline-nealelab/blob/4f862e31b8093d25fdaa8da7f841b9be8583cda4/scripts/gwas.py#L268
- This works on a dataset with fewer variants (chr XY, which has ~8k variants compared to ~141k for chr 21)
- The operation utilizes only one or two workers in a cluster of 20 n1-highmem-8 instances:
CPU utilization across worker VMs
- Drilling in on one of the workers that is running all the tasks, I see that the only not obviously parallelizable task it seems to be running is "solve-triangular":
The job ultimately failed with the error "ValueError: Could not find dependent ('transpose-e1c6cc7244771a105b73686cc88c4e43', 42, 21). Check worker logs".
Several of the workers show log messages like this:
distributed.worker - INFO - Dependent not found: ('rechunk-merge-66cac011d34e1c66cde96678a9e011b5', 0, 21) 0 . Asking scheduler
Perhaps this is what happens when one node unexpectedly becomes unreachable? I'm not sure.
I will run this again on a smaller dataset that didn't fail to get a performance report and task graph screenshot (which doesn't work on this data because the UI won't render so many nodes).