Skip to content

Identify lack of scalability in gwas_linear_regression #390

Open
@eric-czech

Description

@eric-czech

It appears that this function does not scale well when run on a cluster.

Notes from my most recent attempt:

CPU utilization across worker VMs
Screen Shot 2020-11-17 at 1 10 08 PM

Status Page
Screen Shot 2020-11-17 at 12 35 39 PM

  • Drilling in on one of the workers that is running all the tasks, I see that the only not obviously parallelizable task it seems to be running is "solve-triangular":

Screen Shot 2020-11-17 at 1 36 57 PM

Full Task List


The job ultimately failed with the error "ValueError: Could not find dependent ('transpose-e1c6cc7244771a105b73686cc88c4e43', 42, 21). Check worker logs".

Several of the workers show log messages like this:

distributed.worker - INFO - Dependent not found: ('rechunk-merge-66cac011d34e1c66cde96678a9e011b5', 0, 21) 0 . Asking scheduler

Perhaps this is what happens when one node unexpectedly becomes unreachable? I'm not sure.

I will run this again on a smaller dataset that didn't fail to get a performance report and task graph screenshot (which doesn't work on this data because the UI won't render so many nodes).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions