Skip to content

Better instrumentation for Worker.gather_dep #7217

@fjetter

Description

@fjetter

Task queuing has been proven to significantly improve performance by reducing root task overproduction

In recent benchmarks and tests I noticed that one major source for root task overproduction is not necessarily that reducers are not assigned fast enough to the workers but that the workers are unable to run these tasks since they need to fetch dependencies first. If average root task runtime is much smaller than it takes to fetch dependencies, this can cause workers to run many data producers before it has the possibility to run a reducer.

Right now, we're almost blind to this situation but could be exposing much better metrics on the dashboard (or Prometheus).

Specifically, I'm interested in

  • How much time do tasks spend in the ready queue before they are worked on? Can we calculate averages on TaskGroups/Prefix? TaskGroups per Worker?
  • How much time do tasks spend in any state, e.g. fetch. In general, how long are wait times in our queues?
  • How long do gather_dep requests typically last, broken down per TaskGroup/Prefix?
  • How much time on the gather_dep request are we spending on
    • Connection establishment (e.g. connection pool empty, remote event loop is blocked, handshake takes a while)
    • Data (de-)serialization
    • Spill-to-disk

Ideally, I would love to get data for a Task X with dependencies deps that tells me

X spent 10s in ready queue
-> 8s spent fetching data
  -> 1s connection
  -> 2s spill-to-disk
  -> 2s (de-)serialization
  -> 2s network transfer
  -> 1s idle / event loop busy
-> 2s spent waiting for open slot on the ThreadPool

Some of this information is already available, other information we still need to collect. I don't think we have anything that can break it up this way and/or group by TaskGroups or individual tasks.

I think this kind of visibility would help us significantly with making decisions about optimizations, e.g. should we prioritize STA? Should we focus on getting a sendfile implementation up and running? Do connection attempts take way too long because event loops are blocked?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions