[core][cgraph] Fix eager release if destruction out of order #49781

dayshah · 2025-01-12T07:20:23Z

Why are these changes needed?

The current implementation of the CompiledDagRef destructor will release the buffer for everything up to the execution index of the dagref that was destructed. This means that calling get on a previous dagref will fail. Now we're assuring that the previous execution for previous dagrefs complete and the results get cached before we release.

Because Python doesn't guarantee destruction order, this has the possible pitfall of requiring finishing execution of other previous dagrefs that are also about to be destructed, but the dagref with the higher execution_index just got del called on it first. Opened issue #49782 for this.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: dayshah <dhyey2019@gmail.com>

ruisearch42 · 2025-01-13T05:26:45Z

python/ray/experimental/compiled_dag_ref.py

+            self._dag._execute_until(
                self._execution_index, self._channel_index, timeout
            )
+            return_vals = self._dag._get_execution_results(
+                self._execution_index, self._channel_index
+            )


why do we prefer using an extra method call?

because we want to use execute_until for release buffer and we don't want to get there because we don't want to remove result from buffer

python/ray/dag/compiled_dag_node.py

python/ray/dag/tests/experimental/test_accelerated_dag.py

Signed-off-by: dayshah <dhyey2019@gmail.com>

python/ray/dag/compiled_dag_node.py

Signed-off-by: dayshah <dhyey2019@gmail.com>

python/ray/dag/tests/experimental/test_accelerated_dag.py

Signed-off-by: dayshah <dhyey2019@gmail.com>

kevin85421

Discussed offline:

Calling _execute_until to consume all DAG references with a smaller execution_index compared to the caller of the destruction may cause the memory usage increases in the driver process.

Instead, we will try to remember the execution_index that should be released and release it only when all DAG references with smaller execution_index values have been consumed.

python/ray/dag/compiled_dag_node.py

Signed-off-by: dayshah <dhyey2019@gmail.com>

Fix eager release if destruction out of order

192d9bf

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah requested review from kevin85421 and ruisearch42 January 12, 2025 07:20

dayshah assigned kevin85421 and ruisearch42 Jan 12, 2025

This was referenced Jan 12, 2025

[core][cgraph] Collapse other params into max_inflight_executions and adjust execution_index counting #49565

Merged

[core][compiled graphs] Don't require finishing previous executions in the case that Python destructs dagrefs out of order #49782

Closed

dayshah added the go add ONLY when ready to merge, run all tests label Jan 12, 2025

ruisearch42 reviewed Jan 13, 2025

View reviewed changes

address comments

ed90570

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah requested a review from ruisearch42 January 13, 2025 18:14

ruisearch42 reviewed Jan 13, 2025

View reviewed changes

python/ray/dag/compiled_dag_node.py Outdated Show resolved Hide resolved

update type hint

48d8b17

Signed-off-by: dayshah <dhyey2019@gmail.com>

ruisearch42 reviewed Jan 13, 2025

View reviewed changes

python/ray/dag/tests/experimental/test_accelerated_dag.py Outdated Show resolved Hide resolved

python/ray/dag/tests/experimental/test_accelerated_dag.py Outdated Show resolved Hide resolved

dayshah added 2 commits January 13, 2025 11:30

update test location and comment

f4f33d3

Signed-off-by: dayshah <dhyey2019@gmail.com>

Merge remote-tracking branch 'origin' into fix-skip-deserialize

3ab4273

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah requested a review from ruisearch42 January 13, 2025 21:02

ray start regular

33c38fa

Signed-off-by: dayshah <dhyey2019@gmail.com>

kevin85421 reviewed Jan 13, 2025

View reviewed changes

ruisearch42 reviewed Jan 14, 2025

View reviewed changes

python/ray/dag/compiled_dag_node.py Show resolved Hide resolved

dayshah requested review from ruisearch42 and kevin85421 January 14, 2025 00:48

dayshah force-pushed the fix-skip-deserialize branch from 4b85ea6 to 33c38fa Compare January 14, 2025 01:21

dayshah mentioned this pull request Jan 15, 2025

[core][cgraph] Rework DagRef Destruction #49818

Merged

8 tasks

case of already cached

033674d

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah closed this Jan 16, 2025

dayshah deleted the fix-skip-deserialize branch January 16, 2025 22:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][cgraph] Fix eager release if destruction out of order #49781

[core][cgraph] Fix eager release if destruction out of order #49781

dayshah commented Jan 12, 2025 •

edited

Loading

ruisearch42 Jan 13, 2025

dayshah Jan 13, 2025

kevin85421 left a comment

[core][cgraph] Fix eager release if destruction out of order #49781

[core][cgraph] Fix eager release if destruction out of order #49781

Conversation

dayshah commented Jan 12, 2025 • edited Loading

Why are these changes needed?

Related issue number

Checks

ruisearch42 Jan 13, 2025

Choose a reason for hiding this comment

dayshah Jan 13, 2025

Choose a reason for hiding this comment

kevin85421 left a comment

Choose a reason for hiding this comment

dayshah commented Jan 12, 2025 •

edited

Loading