Dashboard: Fine Performance Metrics #7725

milesgranger · 2023-03-30T12:29:37Z

Tests added / passed
Passes pre-commit run --all-files

I'm not very satisfied with the implementation itself, open to suggestions. Maybe after we hammer out a data schema / dataclasses for the metrics the parsing will become a bit more ergonomic.

CURRENT STATE:

(I'll update these GIF(s) / comments as the PR progresses)

There is a function selector box to view/keep specific functions, otherwise the functions will be removed from the chart by function after a set amount of time (currently 10s)

So please let me have your input and here is an example of other data which is to be added as well to this page later:

Example data

{'latency': 3.71983003616333,
 'profile-duration': 2.245496988296509,
 'tick-duration': 34.589913845062256,
 'transfer-bandwidth': 1884128.5683220457,
 'transfer-duration': 10.240396499633789,
 ('execute', 'do_work', 'deserialize', 'seconds'): 5.4306990932673216e-05,
 ('execute', 'do_work', 'other', 'seconds'): 0.0006036278791725636,
 ('execute', 'do_work', 'thread-cpu', 'seconds'): 2.276999999999992e-06,
 ('execute', 'do_work', 'thread-noncpu', 'seconds'): 1.4889137695312508e-05,
 ('execute', 'getitem', 'deserialize', 'seconds'): 0.0015358439995907247,
 ('execute', 'getitem', 'memory-read', 'bytes'): 10386398260.0,
 ('execute', 'getitem', 'memory-read', 'count'): 91.0,
 ('execute', 'getitem', 'other', 'seconds'): 5.236524414503947,
 ('execute', 'getitem', 'thread-cpu', 'seconds'): 2.1170829520000005,
 ('execute', 'getitem', 'thread-noncpu', 'seconds'): 2.387235046886108,
 ('execute', 'make-timeseries', 'deserialize', 'seconds'): 0.7757770742173307,
 ('execute', 'make-timeseries', 'other', 'seconds'): 5.269970175053459,
 ('execute', 'make-timeseries', 'thread-cpu', 'seconds'): 28.740306391999994,
 ('execute', 'make-timeseries', 'thread-noncpu', 'seconds'): 30.920829996778693,
 ('execute', 'series-groupby-count-chunk', 'deserialize', 'seconds'): 0.002064074680674821,
 ('execute', 'series-groupby-count-chunk', 'memory-read', 'bytes'): 6254721966.0,
 ('execute', 'series-groupby-count-chunk', 'memory-read', 'count'): 66.0,
 ('execute', 'series-groupby-count-chunk', 'other', 'seconds'): 2.1850595266441815,
 ('execute', 'series-groupby-count-chunk', 'thread-cpu', 'seconds'): 5.864710648999999,
 ('execute', 'series-groupby-count-chunk', 'thread-noncpu', 'seconds'): 6.863055864824464,
 ('execute', 'series-groupby-count-combine', 'deserialize', 'seconds'): 0.00040181202348321676,
 ('execute', 'series-groupby-count-combine', 'memory-read', 'bytes'): 162853.0,
 ('execute', 'series-groupby-count-combine', 'memory-read', 'count'): 48.0,
 ('execute', 'series-groupby-count-combine', 'other', 'seconds'): 0.4925127071910538,
 ('execute', 'series-groupby-count-combine', 'thread-cpu', 'seconds'): 0.011580911999999666,
 ('execute', 'series-groupby-count-combine', 'thread-noncpu', 'seconds'): 0.058887513750732756,
 ('execute', 'series-groupby-sum-chunk', 'deserialize', 'seconds'): 0.001760255021508783,
 ('execute', 'series-groupby-sum-chunk', 'memory-read', 'bytes'): 6065340352.0,
 ('execute', 'series-groupby-sum-chunk', 'memory-read', 'count'): 64.0,
 ('execute', 'series-groupby-sum-chunk', 'other', 'seconds'): 2.655588552413974,
 ('execute', 'series-groupby-sum-chunk', 'thread-cpu', 'seconds'): 5.599028371999999,
 ('execute', 'series-groupby-sum-chunk', 'thread-noncpu', 'seconds'): 5.118410087396363,
 ('execute', 'series-groupby-sum-combine', 'deserialize', 'seconds'): 0.00034545507514849305,
 ('execute', 'series-groupby-sum-combine', 'memory-read', 'bytes'): 135802.0,
 ('execute', 'series-groupby-sum-combine', 'memory-read', 'count'): 40.0,
 ('execute', 'series-groupby-sum-combine', 'other', 'seconds'): 0.22763406857848167,
 ('execute', 'series-groupby-sum-combine', 'thread-cpu', 'seconds'): 0.008895635999999874,
 ('execute', 'series-groupby-sum-combine', 'thread-noncpu', 'seconds'): 0.0017752643448487593,
 ('gather-dep', 'decompress', 'seconds'): 0.0006368309259414673,
 ('gather-dep', 'deserialize', 'seconds'): 0.02663757384289056,
 ('gather-dep', 'network', 'seconds'): 9.711461509461515,
 ('gather-dep', 'other', 'seconds'): 0.9370217787800357,
 ('get-data', 'compress', 'seconds'): 0.0006392610375769436,
 ('get-data', 'memory-read', 'bytes'): 258869.0,
 ('get-data', 'memory-read', 'count'): 76.0,
 ('get-data', 'network', 'seconds'): 4.732347336539533,
 ('get-data', 'serialize', 'seconds'): 0.01716823095921427}

Current state:

github-actions · 2023-04-11T16:51:26Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      26 files ±  0       26 suites ±0 16h 1m 40s ⏱️ + 7m 36s
  3 582 tests +  1   3 472 ✔️ +  4   105 💤 ±0 5 ❌ - 3
45 351 runs +11 43 184 ✔️ +15 2 162 💤 ±0 5 ❌ - 4

For more details on these failures, see this check.

Results for commit a4e1def. ± Comparison against base commit f4f4718.

♻️ This comment has been updated with latest results.

milesgranger · 2023-04-14T10:14:56Z

Ready for a roasting. :)
@fjetter noted this shouldn't be a 'perfect' impl the first round since it's hidden a bit and ought to be improved later on after schema/dataclasses for the fine-grained perf metrics is developed later.

cc @jrbourbeau @j-bennet @hendrikmakait

milesgranger · 2023-04-25T07:12:20Z

@crusaderky maybe you'd have some time to review. 🙏

phofl

very small comments

phofl · 2023-04-26T12:31:12Z

distributed/dashboard/components/scheduler.py

+
+            idx = self.data["operation"].index(operation)
+            while idx >= len(self.data["operation_val"]):
+                self.data["operation_val"].append(float("NaN"))


Might be better to model these two keys as a NamedTuple to protect against unequal lengths?

phofl · 2023-04-26T12:32:27Z

distributed/dashboard/components/scheduler.py

+    def init_root(self):
+        def handle_toggle(_toggle):
+            self.toggle.label = (
+                "Bytes (Toggle for Values)"


These strings look like something that could get out of sync very easily, maybe use variables?

crusaderky

First of all, I must say I'm very excited about this new view.

Must have

Bokeh 3.1 specific issues

The page is broken on Bokeh 3.1 (everything is fine on Bokeh 2):

The plot looks off and I can't see the bar chart for execute.
The plot doesn't seem to update itself in realtime. I have to reload the page every time.
Selecting a prefix makes the plot go completely blank.
Every few seconds, the plot goes blank. On my stderr I get:

bokeh.core.serialization.DeserializationError: can't resolve reference 'p278815'
2023-04-26 16:20:39,524 - bokeh.server.protocol_handler - ERROR - error handling message
 message: Message 'PATCH-DOC' content: {'events': [{'kind': 'ModelChanged', 'model': {'id': 'p582495'}, 'attr': 'inner_height', 'new': 409}]} 
 error: DeserializationError("can't resolve reference 'p582495'")

There's also a stack trace, but it's completely internal to bokeh.

Other must have

Please change unit selection to a dynamically populated dropdown
Changing the unit only impacts the execute-by-prefix chart. It should impact all.
I think it would make more sense for get_data to be a pie chart
Can't find the plot in dask-labextension
The tooltip for execute by prefix needs polishing:
Please add unit tests. Typically (see other plots) these unit tests run the rendering in a couple of relevant use cases (e.g. no tasks / with tasks) and simply verify that they don't raise.
The subtitles are incoherent/confusing. Please change to:
- Task execution, by prefix
- Task execution, by activity
- Send data, by activity (confusingly, get_data is what is called through RPC to get data for yourself; but when it runs locally it's sending the data to a peer)

Bonus

These critiques don't block this PR from merging and can be delivered at a later date.

Please add a plot for gather_dep (title: "Receive data, by activity"). It's not quite redundant with get_data, as for example it can highlight deserialization-specific issues.
The bytes plot, as it is, is not very useful. The reason is that memory-read is going to eclipse everything else. The purpose of that metric is to compare it to disk-read and calculate cache hit ratios; nothing else. It would become instantly more useful if there was a filter by activity so that it can be excluded (just like the filter by prefix).
The stacked bar charts could use a vertical axis
The pie charts could use an indication of percentage in the tooltip

crusaderky · 2023-04-27T14:40:55Z

I'd be happy to merge this PR with only the "Other must have" section done, plus a check that hides it with bokeh 3, and then have a follow-up issue to introduce bokeh 3 support. I don't know the effort involved in making it work for bokeh 3; the overhead may not be worth it.

milesgranger · 2023-04-27T16:46:41Z

Thanks @crusaderky for the thorough review!
What you suggest sounds great, I'll address the 'Other must have' and follow up with the fixes/adjustments. Maybe start on this tomorrow if not early next week. Much appreciated!

[skip ci]

milesgranger · 2023-05-04T13:36:58Z

Please change unit selection to a dynamically populated dropdown

Maybe I misunderstand, but your'e talking about the same toggle button, yes? If it was dynamic those plots are explicitly collecting "bytes" or "seconds"; using thins like format_time/bytes, I don't know how it ought to react with arbitrary other types.

Can't find the plot in dask-labextension

I don't know what this is, and grepping thru the codebase I'm not sure how other plots are connected to this.

crusaderky · 2023-05-09T14:48:13Z

Please change unit selection to a dynamically populated dropdown

Maybe I misunderstand, but your'e talking about the same toggle button, yes? If it was dynamic those plots are explicitly collecting "bytes" or "seconds"; using thins like format_time/bytes, I don't know how it ought to react with arbitrary other types.

It should react to any arbitrary units it finds. For example, somebody may want to track time that doesn't add up to the wall time:

@context_meter.meter("some_stuff", unit="gpu-seconds", func=my_video_card.get_elapsed)
def runs_on_gpu():
    ...

client.submit(runs_on_gpu, key="foo").result()

The above will appear in Scheduler.cumulative_worker_metrics as ("execute", "foo", "some_stuff", "gpu-seconds"): <elapsed>} and, importantly, it will not detract from ("execute", "foo", "other", "seconds") for the purpose of calculating the end-to-end wall time.

Another example: track number of iterations needed to converge:

def converging():
    n = 0
    while not converged():
        n += 1
        ...
    context_meter.digest_metric("blah", n, "iterations")

client.submit(converging, key="c-123").result()

The above will appear in Scheduler.cumulative_worker_metrics as ("execute", "c-123", "blah", "iterations"): <n>}.

I would suggest you keep it simple: use ad-hoc formatting for seconds and bytes and just print everything else unformatted.

Can't find the plot in dask-labextension

I don't know what this is, and grepping thru the codebase I'm not sure how other plots are connected to this.

https://github.com/dask/dask-labextension

milesgranger · 2023-05-10T08:41:38Z

Okay, @crusaderky another look when you have time, have a drop down dynamic unit selection and I don't know about the dask-labextension, seems to work fine for me. Maybe you need to re-install the extension or something? Not that familiar with jupyter lab. 🤷‍♂️

fjetter · 2023-05-10T08:57:56Z

It should react to any arbitrary units it finds. For example, somebody may want to track time that doesn't add up to the wall time:

@crusaderky you are asking for functionality that is not even used at the moment. I consider this entire feature still experimental and I explicitly stated already that this thing does not need to be perfect.

This PR has been open now for more than a month now. I do not want to see any more features being added to this but want this to be wrapped up asap. If anything is missing, please create a follow up ticket for it.

crusaderky · 2023-05-10T23:37:00Z

See code review commit. Mostly renamed stuff to align it to metrics.py.

Sizes are off. The top-left panel is the only one that resizes dynamically; the other two are fixed.
The result is that, in anything but a very specific window size, the top-left panel looks either gigantic or extremely small.
Reloading the page once you pick the window size does not fix the problem.
If you fix this one issue, I'm happy to merge (I'll be away until the 22nd, don't wait for me).

Peek.2023-05-11.00-33.mp4

crusaderky · 2023-05-10T23:43:07Z

Follow-ups:

fjetter · 2023-05-16T10:39:43Z

I'm still observing an artifact with the plots. If I am observing the dashboard while the computation is going on, the metrics are added to the chart as I'd expect.
Once the computation finishes, a couple of tasks are dropped from the dashboard. A couple seconds/a minute later some other tasks show up again.

If I refresh the page, I get accurate results for a short time but the same process of "drop a couple of tasks" happens again.

Considering how long this is open now I'm inclined to merge regardless. There is value in this dashboard and I'd like to get this out asap and prefer polishing it later.

from dask.datasets import timeseries
from distributed import Client
client = Client()
ddf = timeseries("2000", "2020", partition_freq="1w")
ddf.groupby("id").x.mean().compute()

fjetter

Haven't reviewed the core logic. Given that this is an optional and non-critical component, I'd be fine with merging. @hendrikmakait you intended to review as well. I'll leave it up to you to merge.
If during review anything major comes up, please move this to a follow up ticket

fjetter · 2023-05-16T10:05:32Z

distributed/dashboard/tests/test_scheduler_bokeh.py

+
+    futures = c.map(slowinc, range(10), delay=0.001)
+    await wait(futures)
+    await asyncio.sleep(1)  # wait for metrics to arrive


nit (not worthy of holding off the merge): I'd like us to be very conservative with this kind of sleeping since it artificially slows down our test suite.

Ideally, there was a hook we could listen to to know when we can check our assertions. This hook often doesn't exist but we can still implement the test more forgivingly, e.g.

start = time() while True: try: assert not cl.task_exec_data cl.update() assert cl.task_exec_data assert cl.task_exec_data["functions"] == ["slowinc"] break except AssertionError: if time() - start > 1: raise await asyncio.sleep(0.01)

hendrikmakait

Overall LGTM. I haven't review the core logic, but focused on the requested changes. Another small issue I observed is that the charts don't scale if they have no data in them.

I'm fine with merging this and getting it out there. We can iron out the remaining issues in follow-up PRs. This dashboard is already very useful as-is.

fjetter · 2023-05-16T12:18:32Z

🎉

crusaderky · 2023-05-22T12:06:16Z

To recap, these are the follow-ups:

I think at least the first one should be delivered before we can properly advertise this feature to the wider public.

milesgranger force-pushed the 7679-perf-metrics-bokeh-dashboard branch from 6f13c00 to 83045dc Compare March 31, 2023 13:33

milesgranger force-pushed the 7679-perf-metrics-bokeh-dashboard branch from 83045dc to a3965b0 Compare April 11, 2023 12:28

milesgranger changed the title ~~[WIP] Dashboard: Fine Performance Metrics~~ Dashboard: Fine Performance Metrics Apr 11, 2023

milesgranger force-pushed the 7679-perf-metrics-bokeh-dashboard branch 2 times, most recently from 0d25a90 to b93fe0b Compare April 12, 2023 08:27

milesgranger marked this pull request as ready for review April 13, 2023 06:52

milesgranger requested review from fjetter and jacobtomlinson as code owners April 13, 2023 06:52

phofl reviewed Apr 26, 2023

View reviewed changes

crusaderky requested review from crusaderky and removed request for fjetter April 26, 2023 14:19

crusaderky requested changes Apr 26, 2023

View reviewed changes

milesgranger added 13 commits May 4, 2023 13:51

Prototype: Metrics by Execution stacked barchart

ba43e75

[skip ci]

Add aggregated piechart for execution timings [skip ci]

52ee692

Working dated metrics, needs function selector enabled

b598ab8

Smoother updating rotating functions

fed73e2

Properly show selected functions

d585786

Show number of functions to select

367ae0b

Fix misaligned columns

27e483c

Update get-data to only show seconds ops

b1a2810

Make pre-commit happy

dd5d5ff

Fixup: breakpoint

abe5024

Fix color palette when empty ops

ebc05a6

Drop: Trial blanket rerun setting

da239a0

Revert rerunning failed tests on all tests

6010efc

milesgranger force-pushed the 7679-perf-metrics-bokeh-dashboard branch from 5e7947c to 884262d Compare May 4, 2023 13:31

milesgranger added 2 commits May 9, 2023 10:50

Update task execution, by activity toggle bytes

a4feff5

Merge fine metrics classes, and refactor

f5ee251

milesgranger force-pushed the 7679-perf-metrics-bokeh-dashboard branch from d3caebd to f5ee251 Compare May 9, 2023 10:29

Fix rotating functions in task execution by prefix

0ed4cc4

Make unit selection dropdown & dynamic

e994cfc

milesgranger force-pushed the 7679-perf-metrics-bokeh-dashboard branch from 604f902 to e994cfc Compare May 10, 2023 08:35

crusaderky added 2 commits May 10, 2023 19:21

Merge branch 'main' into 7679-perf-metrics-bokeh-dashboard

d9f98e5

Code review

846c1a1

This was referenced May 10, 2023

Fine Performance Metrics Dashboard is broken on Bokeh 3 #7831

Closed

Fine Performance Metrics Dashboard nice-to-haves #7832

Open

lint

d32cf0f

milesgranger force-pushed the 7679-perf-metrics-bokeh-dashboard branch from 4ae732d to 290e5d3 Compare May 11, 2023 07:59

Make all plots dynamic

a4e1def

milesgranger force-pushed the 7679-perf-metrics-bokeh-dashboard branch from 290e5d3 to a4e1def Compare May 11, 2023 08:00

hendrikmakait self-requested a review May 16, 2023 10:06

fjetter approved these changes May 16, 2023

View reviewed changes

hendrikmakait approved these changes May 16, 2023

View reviewed changes

hendrikmakait merged commit 7f272d9 into dask:main May 16, 2023

milesgranger deleted the 7679-perf-metrics-bokeh-dashboard branch May 16, 2023 10:58

crusaderky mentioned this pull request May 22, 2023

Fine Performance Metrics: Dashboard glitch randomly adds/removes tasks #7848

Closed

Uh oh!

Dashboard: Fine Performance Metrics #7725

Dashboard: Fine Performance Metrics #7725

Uh oh!

Conversation

milesgranger commented Mar 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CURRENT STATE:

Uh oh!

github-actions bot commented Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

milesgranger commented Apr 14, 2023

Uh oh!

milesgranger commented Apr 25, 2023

Uh oh!

phofl left a comment

Choose a reason for hiding this comment

Uh oh!

phofl Apr 26, 2023

Choose a reason for hiding this comment

Uh oh!

phofl Apr 26, 2023

Choose a reason for hiding this comment

Uh oh!

crusaderky left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Must have

Bokeh 3.1 specific issues

Other must have

Bonus

Uh oh!

crusaderky commented Apr 27, 2023

Uh oh!

milesgranger commented Apr 27, 2023

Uh oh!

milesgranger commented May 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crusaderky commented May 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

milesgranger commented May 10, 2023

Uh oh!

fjetter commented May 10, 2023

Uh oh!

crusaderky commented May 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crusaderky commented May 10, 2023

Uh oh!

fjetter commented May 16, 2023

Uh oh!

fjetter left a comment

Choose a reason for hiding this comment

Uh oh!

fjetter May 16, 2023

Choose a reason for hiding this comment

Uh oh!

hendrikmakait left a comment

Choose a reason for hiding this comment

Uh oh!

fjetter commented May 16, 2023

Uh oh!

crusaderky commented May 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

milesgranger commented Mar 30, 2023 •

edited

Loading

github-actions bot commented Apr 11, 2023 •

edited

Loading

crusaderky left a comment •

edited

Loading

milesgranger commented May 4, 2023 •

edited

Loading

crusaderky commented May 9, 2023 •

edited

Loading

crusaderky commented May 10, 2023 •

edited

Loading