-
Notifications
You must be signed in to change notification settings - Fork 989
Support cudf-polars str.zfill
#19081
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support cudf-polars str.zfill
#19081
Conversation
|
I assume that this PR will updated to leverage #19090 once that is merged? |
| col, col_width = [ | ||
| child.evaluate(df, context=context) for child in self.children | ||
| ] | ||
| all_gt_0 = plc.binaryop.binary_operation( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we do this introspection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @mroeschke , do we usually introspect data to error at this point? If we don't, then we risk maybe producing the wrong result in some edge cases, like a negative fill value in the zfill column overload. The scalar case is easy enough to introspect and throw, but this is a whole scan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a case where we wouldn't be able to match the result (not necessarily an error) that Polars produces?
If so, yes, I think there's some precedent on introspecting the data to raise an error. We would want to do this in _validate_input though so it can raise during translation. Plus we can do this introspection on the CPU by doing a similar option on Polars objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case this intends to match is the one where one of the fill elements in the column overload is negative, polars throws at runtime in this case. I suppose we want to match that behavior, but it's a performance hit to every call to the column overload :(
|
Sorry @mroeschke , just getting back to this now. I've fixed up the PR which should now be passing tests, and responded to the open question above. |
| if width.value is not None and width.value < 0: | ||
| dtypestr = dtype_str_repr(width.dtype.polars) | ||
| raise InvalidOperationError( | ||
| f"conversion from `{dtypestr}` to `u64` " | ||
| f"failed in column 'literal' for 1 out of " | ||
| f"1 values: [{width.value}]" | ||
| ) from None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we put this in _validate_input? Raising there will enable fallback during translation.
|
Looks like |
Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>
Should this be a polars issue? Who is right? |
Given that Polars has a dedicated test checking that zeros aren't padded on unicode, I suppose they are "right" and ideally libcudf wouldn't pad unicode characters |
It may be this piece of the docs: Ah. I wonder if we need to do something beyond xfailing here? It seems like in the wild users would be mostly
None of the above seem particularly amazing. I'll think on what to do in lieu of any obvious ideas I missed. |
Ah OK so I suppose the "proper" solution could use |
Thanks- this sounds like a winner 😄 |
|
It looks like polars wants to pad strings like |
|
after playing around with this a bit I can't think of a way out of this without throwing for the cases where the APIs diverge. AFAICT they'll only definitely agree for a certain subset of characters in the source data. @davidwendt whats the fastest way of checking if a column of strings contains only alphanumeric characters, spaces, or empty strings? |
| ["abc", "def"], | ||
| ], | ||
| ) | ||
| def test_string_zfill(fill, input_strings): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you run these tests with polars 1.29? You might need to version guard these tests ie if POLARS_VERSION_LT_130
Maybe |
I believe this API would cover everything except empty strings |
|
Added some validation and conditional xfailing based on the polars version here. We now guard for non-ascii characters in the simplest way I can think of. It's not ideal but I think we should move forward and reassess if perf concerns ever surface. |
| 5 | ||
| if not POLARS_VERSION_LT_130 | ||
| else pytest.param(5, marks=pytest.mark.xfail(reason="fixed in Polars 1.30")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 5 | |
| if not POLARS_VERSION_LT_130 | |
| else pytest.param(5, marks=pytest.mark.xfail(reason="fixed in Polars 1.30")), | |
| pytest.param(5, marks=pytest.mark.xfail(POLARS_VERSION_LT_130, reason="fixed in Polars 1.30")), |
(For a follow up PR if interested)
|
/merge |
Closes #19035
Closes #16480
I believe this needs pola-rs/polars#22985 to pass the one remaining failing test and the column overload described here #19035 (comment)