Description
dask/distributed#6614 causes a significant performance improvement to test_vorticity
:
And a significant performance regression to test_dataframe_align
:
I think the key change is fixing dask/distributed#6597.
test_vorticity
comes from dask/distributed#6571. This workload is the one where we discovered the co-assignment bug that's now fixed. So it's encouraging (and not surprising) that improving co-assignment significantly reduces transfers, and improves performance.
test_dataframe_align
is a bit surprising. You'd think that assigning matching partitions of the two input dataframes to the same worker would reduce downstream transfers—which it indeed does.
Worth noting: in my original benchmarks, test_dataframe_align
was probably the most affected by root task overproduction out of everything I tested:
I ran it manually a few times both before and after the queuing PR. I also tried turning work stealing off, but it didn't make a difference.
Before
After
If you stare at these GIFs for a while, you can notice that:
- In "after", progress bars 3, 4, and 5 take longer to start, then accelerate faster. In "before", it feels like there's more of a linear lag between the top two bars and the next three.
- The memory usage peaks higher "after" (we already knew this from the benchmark)
Here's a comparison between the two dashboards at the same point through the graph (785 make-timeseries
tasks complete):
This is also a critical point, because it's when we start spilling to disk in the "after" case. The "before" case never spills to disk.
You can see that:
- There were no transfers in the "after" case. Before, we had lots of transfers. So improved co-assignment is working correctly.
- There are way more
repartition-merge
tasks in memory (top progress bar) in the "after" case. Nearly every executed task is in memory. Compare that to the before case, where half have already been released from memory. - ...because there are way fewer
sub
,dataframe-count
, anddataframe-sum
executed. These are the data-consuming tasks.- A bunch of
sub
s have just been scheduled. A moment before, none of those were in processing. So sadly, they arrive just a moment too late to prevent the workers from spilling to disk.
- A bunch of
The simplest hypothesis is that the repartition-merge
s are completing way faster now, since they don't have to transfer data. Maybe that increase in speed gives them the chance to run further ahead of the scheduler before it can submit the sub
tasks? This pushes the workers over the spill threshold, so then everything slows down.
I know I tend to blame things on root task overproduction, so I'm trying hard to come up with some other explanation here. But it does feel a bit like, because there's no data transfer holding the tasks back anymore, they are able to complete faster than the scheduler is able to schedule data-consuming tasks.
What's confusing is just that repartition-merge
isn't a root task—and we see relatively few make-timeseries
or repartition-split
s hanging around in memory. So why is it that scheduling sub
s lags behind more?