Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Total CPU % on /workers tab makes little sense #8490

Open
crusaderky opened this issue Feb 3, 2024 · 2 comments
Open

Total CPU % on /workers tab makes little sense #8490

crusaderky opened this issue Feb 3, 2024 · 2 comments
Labels
dashboard diagnostics good first issue Clearly described and easy to accomplish. Good for beginners to the project.

Comments

@crusaderky
Copy link
Collaborator

crusaderky commented Feb 3, 2024

From https://dask.discourse.group/t/dask-worker-using-600-of-the-cpu/2489/2

The CPU % on each individual worker scales from 0 to nthreads*100; e.g. on a worker with 8 threads it can go from 0% to 800%. This is coherent with several other CPU monitors in the wild so it makes sense.

The CPU% on the Total row, however, is calculated as

elif name == "cpu":
total_data = (
sum(ws.metrics["cpu"] for ws in self.scheduler.workers.values())
/ 100
/ len(self.scheduler.workers.values())
)

So for example on a cluster with 2 workers, 8 threads per worker, if one worker is flat out busy while the other is idle, the Total will be 400%, which makes very little sense.

name nthreads cpu
Total (2) 16 400%
tcp://... 8 800%
tcp://... 8 0%

I think we should change the CPU% on each worker to go from 0 to 100% and that on the total line to do the same (total CPU usage across the cluster / total number of threads)
In the above example, that would become

name nthreads cpu
Total (2) 16 50%
tcp://... 8 100%
tcp://... 8 0%
@crusaderky crusaderky added good first issue Clearly described and easy to accomplish. Good for beginners to the project. dashboard diagnostics labels Feb 3, 2024
@hendrikmakait
Copy link
Member

I'm a bit torn here:

On the one hand, summing the CPU usage up feels useful for a single worker. For example, being stuck at 100% might indicate that we're not able to effectively use our multi-core CPUs. Being stuck at 12.5% (on an 8-core machine) feels less useful, in particular, since we don't ever tell you the number of cores on your machine.

On the other hand, summing the CPU usage up makes the total meaningless very quickly, in particular on an adaptive cluster. (Cool, CPU usage is >9000%....what does that even mean?)

There are a few alternatives that come to mind:

  • Sum everything up, but provide an upper bound 400% (800%) (presentation is TBD).
  • Collect the CPU statistics per CPU (psutil.cpu_percent(percpu=True)) and calculate some truly meaningful statistics, e.g., min, max, mean, median, 20/80 pct, etc.

@crusaderky
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dashboard diagnostics good first issue Clearly described and easy to accomplish. Good for beginners to the project.
Projects
None yet
Development

No branches or pull requests

2 participants