Skip to content

Conversation

@fjetter
Copy link
Member

@fjetter fjetter commented Apr 4, 2025

This is a long overdue follow up of the Task class changes and is aiming to reduce transmission overhead and reduce memory utilization on the scheduler during graph submission.

The need to materialize the dependencies as a separate dict is entirely redundant. For backwards compatibility, _materialize_graph generated a materialized version of the dependencies mapping to keep the changes minimal at the time.

The caveat was that Scheduler._remove_done_tasks_from_dsk was actually mutating that view which turns out to be false behavior.

dependencies of a task are immutable and every attempt to treat it differently is very likely corrupting the state or at the very least is altering the graph itself. This argument alone should already suffice to see that changes to dependencies are a bad idea.
After closer review of the code, I believe that the entire logic around _remove_done_tasks_from_dsk is actually redundant and can be removed. This method was parsing the dsk graph for tasks that were already in memory or were already erred and removed them from dsk accordingly. The reasoning is sound since we do not want to recompute those tasks again. However, the transition engine is already taking care of this and is producing appropriate recommendations that end up doing nothing.
Ultimately, this is a performance tradeoff. This logic in _remove_done_tasks_from_dsk allows us to not throw already computed keys into the transition engine which is arguably slower than the _remove_done_tasks_from_dsk logic itself. Essentially, this makes repeated persists faster at the cost of slowing everything else down.
In particular, _remove_done_tasks_from_dsk contains a call to reverse_dict which walks the entire graph and all edges and constructs a dict with dependents. Contrary to dependencies, dependents are ephemeral and have to be recomputed every time. This is actually expensive and not doing it is helpful.

I haven't done any measuremnets but I strongly suspect this is a win-win (even repeated persists are likely faster).

There may be a couple of wrinkles in CI that I'll have to check on but I'm very confident about the change itself.

@github-actions
Copy link
Contributor

github-actions bot commented Apr 4, 2025

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    21 files   -      6      21 suites   - 6   8h 27m 46s ⏱️ - 2h 38m 8s
 4 089 tests  -     18   3 978 ✅  -     16    111 💤 ±  0  0 ❌  - 2 
39 347 runs   - 12 144  37 449 ✅  - 11 756  1 898 💤  - 386  0 ❌  - 2 

Results for commit e2284bb. ± Comparison against base commit 01ea1eb.

This pull request removes 26 and adds 8 tests. Note that renamed tests count towards both.
distributed.cli.tests.test_dask_scheduler ‑ test_signal_handling[Signals.SIGINT]
distributed.cli.tests.test_dask_scheduler ‑ test_signal_handling[Signals.SIGTERM]
distributed.diagnostics.tests.test_nvml ‑ test_1_visible_devices
distributed.diagnostics.tests.test_nvml ‑ test_2_visible_devices[0,1]
distributed.diagnostics.tests.test_nvml ‑ test_2_visible_devices[1,0]
distributed.diagnostics.tests.test_nvml ‑ test_enable_disable_nvml
distributed.diagnostics.tests.test_nvml ‑ test_gpu_metrics
distributed.diagnostics.tests.test_nvml ‑ test_gpu_monitoring_range_query
distributed.diagnostics.tests.test_nvml ‑ test_gpu_monitoring_recent
distributed.diagnostics.tests.test_nvml ‑ test_has_cuda_context
…
distributed.tests.test_client ‑ test_compute_partially_forgotten[False-False]
distributed.tests.test_client ‑ test_compute_partially_forgotten[False-True]
distributed.tests.test_client ‑ test_compute_partially_forgotten[True-False]
distributed.tests.test_client ‑ test_compute_partially_forgotten[True-True]
distributed.tests.test_client ‑ test_map_accepts_nested_futures[False]
distributed.tests.test_client ‑ test_map_accepts_nested_futures[future]
distributed.tests.test_client ‑ test_map_accepts_nested_futures[simple]
distributed.tests.test_scheduler ‑ test_dont_recompute_if_erred_transition_log
This pull request skips 12 tests.
distributed.dashboard.tests.test_scheduler_bokeh ‑ test_counters
distributed.dashboard.tests.test_worker_bokeh ‑ test_counters
distributed.protocol.tests.test_compression ‑ test_compression_thread_safety[snappy-bytes]
distributed.protocol.tests.test_compression ‑ test_compression_thread_safety[snappy-memoryview]
distributed.protocol.tests.test_compression ‑ test_large_messages[snappy]
distributed.protocol.tests.test_compression ‑ test_maybe_compress[snappy-bytes]
distributed.protocol.tests.test_compression ‑ test_maybe_compress[snappy-memoryview]
distributed.protocol.tests.test_compression ‑ test_maybe_compress_memoryviews[snappy]
distributed.protocol.tests.test_compression ‑ test_maybe_compress_sample[snappy]
distributed.tests.test_core ‑ test_tick_logging
…

♻️ This comment has been updated with latest results.

@fjetter fjetter marked this pull request as ready for review April 8, 2025 13:26
@fjetter fjetter requested a review from jacobtomlinson as a code owner April 8, 2025 13:26
@fjetter fjetter requested a review from Copilot April 9, 2025 07:10
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

distributed/scheduler.py:5016

  • [nitpick] Consider using an explicit check for key existence (e.g. returning None) instead of using an empty tuple as a default, to improve code clarity and intent.
if tspec := dsk.get(k, ()):

@fjetter
Copy link
Member Author

fjetter commented Apr 16, 2025

There's a genuine error in test_gh2187

@fjetter
Copy link
Member Author

fjetter commented May 6, 2025

I can reproduce flakyness of test_gh2187 on main as well. this change seems to make it much more likely to occur

@fjetter fjetter force-pushed the update_graph_remove_dependencies branch 3 times, most recently from 35a9e8f to 1634935 Compare May 7, 2025 09:24
@fjetter
Copy link
Member Author

fjetter commented May 7, 2025

This now builds on #9068

@fjetter fjetter force-pushed the update_graph_remove_dependencies branch from 1634935 to 8faeae6 Compare May 8, 2025 14:08
@fjetter fjetter merged commit a7b7e00 into dask:main May 9, 2025
25 of 33 checks passed
@fjetter fjetter deleted the update_graph_remove_dependencies branch May 9, 2025 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant