Skip to content

Performance regressions after queuing PR #295

Open
@gjoseph92

Description

@gjoseph92

dask/distributed#6614 causes a significant performance improvement to test_vorticity:
image

And a significant performance regression to test_dataframe_align:
image

I think the key change is fixing dask/distributed#6597.

test_vorticity comes from dask/distributed#6571. This workload is the one where we discovered the co-assignment bug that's now fixed. So it's encouraging (and not surprising) that improving co-assignment significantly reduces transfers, and improves performance.

test_dataframe_align is a bit surprising. You'd think that assigning matching partitions of the two input dataframes to the same worker would reduce downstream transfers—which it indeed does.

Worth noting: in my original benchmarks, test_dataframe_align was probably the most affected by root task overproduction out of everything I tested:

I ran it manually a few times both before and after the queuing PR. I also tried turning work stealing off, but it didn't make a difference.

Before

before

After

after

If you stare at these GIFs for a while, you can notice that:

  1. In "after", progress bars 3, 4, and 5 take longer to start, then accelerate faster. In "before", it feels like there's more of a linear lag between the top two bars and the next three.
  2. The memory usage peaks higher "after" (we already knew this from the benchmark)

Here's a comparison between the two dashboards at the same point through the graph (785 make-timeseries tasks complete):

comparison2

This is also a critical point, because it's when we start spilling to disk in the "after" case. The "before" case never spills to disk.

You can see that:

  1. There were no transfers in the "after" case. Before, we had lots of transfers. So improved co-assignment is working correctly.
  2. There are way more repartition-merge tasks in memory (top progress bar) in the "after" case. Nearly every executed task is in memory. Compare that to the before case, where half have already been released from memory.
  3. ...because there are way fewer sub, dataframe-count, and dataframe-sum executed. These are the data-consuming tasks.
    1. A bunch of subs have just been scheduled. A moment before, none of those were in processing. So sadly, they arrive just a moment too late to prevent the workers from spilling to disk.

The simplest hypothesis is that the repartition-merges are completing way faster now, since they don't have to transfer data. Maybe that increase in speed gives them the chance to run further ahead of the scheduler before it can submit the sub tasks? This pushes the workers over the spill threshold, so then everything slows down.

I know I tend to blame things on root task overproduction, so I'm trying hard to come up with some other explanation here. But it does feel a bit like, because there's no data transfer holding the tasks back anymore, they are able to complete faster than the scheduler is able to schedule data-consuming tasks.

What's confusing is just that repartition-merge isn't a root task—and we see relatively few make-timeseries or repartition-splits hanging around in memory. So why is it that scheduling subs lags behind more?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions