Single-partition Dask executor for cuDF-Polars #17262

rjzamora · 2024-11-07T01:44:37Z

Description

The goal here is to lay down the initial foundation for dask-based evaluation of IR graphs in cudf-polars. The first pass will only support single-partition workloads. This functionality could be achieved with much less-complicated changes to cudf-polars. However, we do want to build multi-partition support on top of this.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

python/cudf_polars/cudf_polars/experimental/single.py

python/cudf_polars/cudf_polars/callback.py

python/cudf_polars/cudf_polars/dsl/ir.py

python/cudf_polars/cudf_polars/experimental/single.py

…to cudf-polars-dask-simple

python/cudf_polars/cudf_polars/experimental/parallel.py

…-dask-simple

Co-authored-by: Lawrence Mitchell <wence@gmx.li>

rjzamora · 2024-11-13T15:56:44Z

python/cudf_polars/tests/experimental/test_parallel.py

I think it's a good idea to keep parallel tests here.

With that said, I wonder if it makes sense to somehow run the entire test suite with executor="dask" when dask is installed? (not sure how this would work, but all tests should technically work with a single partition)

Yes, apparently tests do work, I just copied a couple for some initial testing, I didn't want to duplicate everything. We do have a few options if we want to test everything:

Explicitly parametrize all tests with executor: [None, "dask"] (None and "cudf" both mean the "default" executor);

Add some sort of fixture to automatically parametrize tests with both executors;

Add a pytest argument to control the behavior of 2 so that we can only enable Dask tests explicitly, at least for now and later turn it on by default;

Others?

Add some sort of fixture to automatically parametrize tests with both executors;

We probably don't need to test everything for this specific PR. However, I think it may make sense to go in this direction pretty soon. We will probably want to make sure that single-partition execution continues working for the
entire test suite as multi-partition support is added.

@wence- @rjzamora I have made the changes we discussed earlier today in 2b74f28 . It adds a new --executor pytest command-line argument that has the default value of "cudf" (default executor) but allows us to run with --executor dask-experimental (I've also renamed from "dask" to "dask-experimental" in c8ca09e, as discussed as well) to rerun the test suite with that executor. The caveat is that to be the least intrusive as possible in the API I had to add an Executor variable to cudf_polars.testing.asserts, which allows us to modify it upon pytest entry in the pytest_configure function in conftest.py. The advantage of this approach is we don't need to force the user to always specify the executor to assert_gpu_result_equal via its API (and thus prevent things like forgetting to pass it), but the obvious downside is the need to modify the cudf_polars.testing.asserts.Executor module variable which always feels as a bit of a hacky solution.

I'm happy to change this to whatever way you feel may suit best, or if you can think of a better solution please let me know too.

…-dask-simple

python/cudf_polars/cudf_polars/experimental/parallel.py

…cudf-polars-dask-simple

python/cudf_polars/cudf_polars/experimental/parallel.py

…cudf-polars-dask-simple

…-dask-simple

ci/run_cudf_polars_pytests.sh

rjzamora · 2024-11-14T20:09:15Z

python/cudf_polars/cudf_polars/testing/asserts.py

@@ -81,7 +89,7 @@ def assert_gpu_result_equal(
    )

    expect = lazydf.collect(**final_polars_collect_kwargs)
-    engine = GPUEngine(raise_on_fail=True)
+    engine = GPUEngine(raise_on_fail=True, executor=Executor)


Should this be something like executor=executor or Executor?
Right now, it seems like the executor is always ignored.

This was a leftover from a previous change, I intended to remove the executor kwarg. I've done that now in 22678a5, but we may want to change this still depending on how the discussion in #17262 (comment) goes.

rjzamora · 2024-11-14T22:31:30Z

python/cudf_polars/cudf_polars/dsl/ir.py

-        self._non_child_args = (name, self.options)
+        self._non_child_args = (schema, name, self.options)
+
+    def get_hashable(self) -> Hashable:


Suggested change

def get_hashable(self) -> Hashable:

def get_hashable(self) -> Hashable: # pragma: no cover; Needed by experimental

Pretty sure this is lowering test coverage.

I introduced basic testing for all executors independent of --executor pytest argument to ensure 100% coverage always.

See 9b78d8f .

…-dask-simple

python/cudf_polars/cudf_polars/experimental/parallel.py

…cudf-polars-dask-simple

…-dask-simple

rjzamora added 2 commits November 6, 2024 14:58

cleanup

a590076

rename to parallel

7f1bec7

rjzamora added 5 - DO NOT MERGE Hold off on merging; see PR for details improvement Improvement / enhancement to an existing function non-breaking Non-breaking change cudf.polars Issues specific to cudf.polars labels Nov 7, 2024

rjzamora self-assigned this Nov 7, 2024

Merge branch 'branch-24.12' into cudf-polars-dask-simple

023e085

github-actions bot added the Python Affects Python cuDF API. label Nov 7, 2024

rjzamora commented Nov 7, 2024

View reviewed changes

python/cudf_polars/cudf_polars/experimental/single.py Outdated Show resolved Hide resolved

rjzamora commented Nov 7, 2024

View reviewed changes

python/cudf_polars/cudf_polars/experimental/single.py Outdated Show resolved Hide resolved

rjzamora added 2 - In Progress Currently a work in progress and removed 5 - DO NOT MERGE Hold off on merging; see PR for details labels Nov 7, 2024

Merge branch 'branch-24.12' into cudf-polars-dask-simple

e7a2fce

wence- reviewed Nov 7, 2024

View reviewed changes

rjzamora added 8 commits November 7, 2024 13:28

simplify solution

69a3374

Merge branch 'cudf-polars-dask-simple' of github.com:rjzamora/cudf in…

6aa3694

…to cudf-polars-dask-simple

Merge branch 'branch-24.12' into cudf-polars-dask-simple

ea22a9a

deeper dive

915a779

improve simple agg reduction

bd9d783

cleanup fundamental bugs

7363d91

move PartitionInfo

58ee5f4

add Literal

ecc51ef

wence- reviewed Nov 11, 2024

View reviewed changes

python/cudf_polars/cudf_polars/experimental/parallel.py Outdated Show resolved Hide resolved

rjzamora added 5 commits November 12, 2024 08:28

Merge branch 'branch-24.12' into cudf-polars-dask-simple

75eae0c

add lower_ir_graph

fb2d6bf

Merge remote-tracking branch 'upstream/branch-24.12' into cudf-polars…

c17564c

…-dask-simple

strip out most exploratory logic

6e66998

Merge branch 'branch-24.12' into cudf-polars-dask-simple

c41723d

rjzamora changed the title ~~[DNM][WIP] Single-partition Dask executor for cuDF-Polars~~ Single-partition Dask executor for cuDF-Polars Nov 12, 2024

Improve count code

8aed94f

Co-authored-by: Lawrence Mitchell <wence@gmx.li>

rjzamora commented Nov 13, 2024

View reviewed changes

pentschev and others added 3 commits November 13, 2024 08:07

Pass executor to GPUEngine in assert_gpu_result_equal

aadaf10

Merge remote-tracking branch 'upstream/branch-24.12' into cudf-polars…

c3a6907

…-dask-simple

Merge branch 'branch-24.12' into cudf-polars-dask-simple

4f67819

rjzamora commented Nov 14, 2024

View reviewed changes

python/cudf_polars/cudf_polars/experimental/parallel.py Show resolved Hide resolved

pentschev and others added 4 commits November 14, 2024 08:26

Clarify intent renaming executor to "dask-experimental"

c8ca09e

move PartitionInfo out of ir module

3fd51bb

Merge remote-tracking branch 'rjzamora/cudf-polars-dask-simple' into …

bf182e4

…cudf-polars-dask-simple

skip coverage on sanity-check errors

453e274

rjzamora commented Nov 14, 2024

View reviewed changes

python/cudf_polars/cudf_polars/experimental/parallel.py Show resolved Hide resolved

pentschev added 5 commits November 14, 2024 11:32

Add --executor to pytest

2b74f28

Merge remote-tracking branch 'rjzamora/cudf-polars-dask-simple' into …

6d3cd55

…cudf-polars-dask-simple

Enable dask-experimental tests in CI, remove duplicates

2398a2e

Fix wrong protocol name in deserialization test

9aa479a

Merge remote-tracking branch 'upstream/branch-24.12' into cudf-polars…

64ea98e

…-dask-simple

rjzamora commented Nov 14, 2024

View reviewed changes

ci/run_cudf_polars_pytests.sh Show resolved Hide resolved

rjzamora commented Nov 14, 2024

View reviewed changes

Remove executor kwarg from assert_gpu_result_equal

22678a5

rjzamora commented Nov 14, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/branch-24.12' into cudf-polars…

41441ca

…-dask-simple

rjzamora commented Nov 15, 2024

View reviewed changes

python/cudf_polars/cudf_polars/experimental/parallel.py Show resolved Hide resolved

pentschev and others added 7 commits November 15, 2024 07:48

Reintroduce executor kwarg in assert_gpu_result_equal

efadb78

Add basic tests for all executors to ensure 100% coverage

9b78d8f

Merge remote-tracking branch 'rjzamora/cudf-polars-dask-simple' into …

c54c217

…cudf-polars-dask-simple

Merge remote-tracking branch 'upstream/branch-24.12' into cudf-polars…

70da7a9

…-dask-simple

Fix executor in assert_gpu_result_equal

3aeb1e4

Merge remote-tracking branch 'upstream/branch-24.12' into cudf-polars…

485a161

…-dask-simple

Merge remote-tracking branch 'upstream/branch-24.12' into cudf-polars…

eb41100

…-dask-simple

pentschev mentioned this pull request Nov 19, 2024

Prevent PyDataFrame serialization #17364

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single-partition Dask executor for cuDF-Polars #17262

Single-partition Dask executor for cuDF-Polars #17262

rjzamora commented Nov 7, 2024 •

edited

Loading

rjzamora Nov 13, 2024

pentschev Nov 13, 2024

rjzamora Nov 13, 2024

pentschev Nov 14, 2024

rjzamora Nov 14, 2024

pentschev Nov 14, 2024

rjzamora Nov 14, 2024

pentschev Nov 15, 2024

pentschev Nov 15, 2024

	def get_hashable(self) -> Hashable:
	def get_hashable(self) -> Hashable: # pragma: no cover; Needed by experimental

Single-partition Dask executor for cuDF-Polars #17262

Are you sure you want to change the base?

Single-partition Dask executor for cuDF-Polars #17262

Conversation

rjzamora commented Nov 7, 2024 • edited Loading

Description

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjzamora commented Nov 7, 2024 •

edited

Loading