-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single-partition Dask executor for cuDF-Polars #17262
base: branch-24.12
Are you sure you want to change the base?
Conversation
…to cudf-polars-dask-simple
Co-authored-by: Lawrence Mitchell <wence@gmx.li>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good idea to keep parallel tests here.
With that said, I wonder if it makes sense to somehow run the entire test suite with executor="dask"
when dask is installed? (not sure how this would work, but all tests should technically work with a single partition)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, apparently tests do work, I just copied a couple for some initial testing, I didn't want to duplicate everything. We do have a few options if we want to test everything:
- Explicitly parametrize all tests with
executor: [None, "dask"]
(None
and"cudf"
both mean the "default" executor); - Add some sort of fixture to automatically parametrize tests with both executors;
- Add a pytest argument to control the behavior of 2 so that we can only enable Dask tests explicitly, at least for now and later turn it on by default;
- Others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some sort of fixture to automatically parametrize tests with both executors;
We probably don't need to test everything for this specific PR. However, I think it may make sense to go in this direction pretty soon. We will probably want to make sure that single-partition execution continues working for the
entire test suite as multi-partition support is added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wence- @rjzamora I have made the changes we discussed earlier today in 2b74f28 . It adds a new --executor
pytest command-line argument that has the default value of "cudf"
(default executor) but allows us to run with --executor dask-experimental
(I've also renamed from "dask"
to "dask-experimental"
in c8ca09e, as discussed as well) to rerun the test suite with that executor. The caveat is that to be the least intrusive as possible in the API I had to add an Executor
variable to cudf_polars.testing.asserts
, which allows us to modify it upon pytest entry in the pytest_configure
function in conftest.py
. The advantage of this approach is we don't need to force the user to always specify the executor
to assert_gpu_result_equal
via its API (and thus prevent things like forgetting to pass it), but the obvious downside is the need to modify the cudf_polars.testing.asserts.Executor
module variable which always feels as a bit of a hacky solution.
I'm happy to change this to whatever way you feel may suit best, or if you can think of a better solution please let me know too.
…cudf-polars-dask-simple
@@ -81,7 +89,7 @@ def assert_gpu_result_equal( | |||
) | |||
|
|||
expect = lazydf.collect(**final_polars_collect_kwargs) | |||
engine = GPUEngine(raise_on_fail=True) | |||
engine = GPUEngine(raise_on_fail=True, executor=Executor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be something like executor=executor or Executor
?
Right now, it seems like the executor
is always ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a leftover from a previous change, I intended to remove the executor
kwarg. I've done that now in 22678a5, but we may want to change this still depending on how the discussion in #17262 (comment) goes.
self._non_child_args = (name, self.options) | ||
self._non_child_args = (schema, name, self.options) | ||
|
||
def get_hashable(self) -> Hashable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_hashable(self) -> Hashable: | |
def get_hashable(self) -> Hashable: # pragma: no cover; Needed by experimental |
Pretty sure this is lowering test coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I introduced basic testing for all executors independent of --executor
pytest argument to ensure 100% coverage always.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See 9b78d8f .
…cudf-polars-dask-simple
Description
The goal here is to lay down the initial foundation for dask-based evaluation of
IR
graphs in cudf-polars. The first pass will only support single-partition workloads. This functionality could be achieved with much less-complicated changes to cudf-polars. However, we do want to build multi-partition support on top of this.Checklist