Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single-partition Dask executor for cuDF-Polars #17262

Draft
wants to merge 46 commits into
base: branch-24.12
Choose a base branch
from

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Nov 7, 2024

Description

The goal here is to lay down the initial foundation for dask-based evaluation of IR graphs in cudf-polars. The first pass will only support single-partition workloads. This functionality could be achieved with much less-complicated changes to cudf-polars. However, we do want to build multi-partition support on top of this.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora added 5 - DO NOT MERGE Hold off on merging; see PR for details improvement Improvement / enhancement to an existing function non-breaking Non-breaking change cudf.polars Issues specific to cudf.polars labels Nov 7, 2024
@rjzamora rjzamora self-assigned this Nov 7, 2024
@github-actions github-actions bot added the Python Affects Python cuDF API. label Nov 7, 2024
@rjzamora rjzamora added 2 - In Progress Currently a work in progress and removed 5 - DO NOT MERGE Hold off on merging; see PR for details labels Nov 7, 2024
python/cudf_polars/cudf_polars/callback.py Outdated Show resolved Hide resolved
python/cudf_polars/cudf_polars/dsl/ir.py Outdated Show resolved Hide resolved
python/cudf_polars/cudf_polars/dsl/ir.py Show resolved Hide resolved
python/cudf_polars/cudf_polars/experimental/single.py Outdated Show resolved Hide resolved
python/cudf_polars/cudf_polars/experimental/single.py Outdated Show resolved Hide resolved
python/cudf_polars/cudf_polars/experimental/single.py Outdated Show resolved Hide resolved
@rjzamora rjzamora changed the title [DNM][WIP] Single-partition Dask executor for cuDF-Polars Single-partition Dask executor for cuDF-Polars Nov 12, 2024
Co-authored-by: Lawrence Mitchell <wence@gmx.li>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea to keep parallel tests here.

With that said, I wonder if it makes sense to somehow run the entire test suite with executor="dask" when dask is installed? (not sure how this would work, but all tests should technically work with a single partition)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, apparently tests do work, I just copied a couple for some initial testing, I didn't want to duplicate everything. We do have a few options if we want to test everything:

  1. Explicitly parametrize all tests with executor: [None, "dask"] (None and "cudf" both mean the "default" executor);
  2. Add some sort of fixture to automatically parametrize tests with both executors;
  3. Add a pytest argument to control the behavior of 2 so that we can only enable Dask tests explicitly, at least for now and later turn it on by default;
  4. Others?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some sort of fixture to automatically parametrize tests with both executors;

We probably don't need to test everything for this specific PR. However, I think it may make sense to go in this direction pretty soon. We will probably want to make sure that single-partition execution continues working for the
entire test suite as multi-partition support is added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wence- @rjzamora I have made the changes we discussed earlier today in 2b74f28 . It adds a new --executor pytest command-line argument that has the default value of "cudf" (default executor) but allows us to run with --executor dask-experimental (I've also renamed from "dask" to "dask-experimental" in c8ca09e, as discussed as well) to rerun the test suite with that executor. The caveat is that to be the least intrusive as possible in the API I had to add an Executor variable to cudf_polars.testing.asserts, which allows us to modify it upon pytest entry in the pytest_configure function in conftest.py. The advantage of this approach is we don't need to force the user to always specify the executor to assert_gpu_result_equal via its API (and thus prevent things like forgetting to pass it), but the obvious downside is the need to modify the cudf_polars.testing.asserts.Executor module variable which always feels as a bit of a hacky solution.

I'm happy to change this to whatever way you feel may suit best, or if you can think of a better solution please let me know too.

@@ -81,7 +89,7 @@ def assert_gpu_result_equal(
)

expect = lazydf.collect(**final_polars_collect_kwargs)
engine = GPUEngine(raise_on_fail=True)
engine = GPUEngine(raise_on_fail=True, executor=Executor)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be something like executor=executor or Executor?
Right now, it seems like the executor is always ignored.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a leftover from a previous change, I intended to remove the executor kwarg. I've done that now in 22678a5, but we may want to change this still depending on how the discussion in #17262 (comment) goes.

self._non_child_args = (name, self.options)
self._non_child_args = (schema, name, self.options)

def get_hashable(self) -> Hashable:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_hashable(self) -> Hashable:
def get_hashable(self) -> Hashable: # pragma: no cover; Needed by experimental

Pretty sure this is lowering test coverage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced basic testing for all executors independent of --executor pytest argument to ensure 100% coverage always.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 9b78d8f .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress cudf.polars Issues specific to cudf.polars improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: In Progress
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants