Skip to content

Conversation

@brandon-b-miller
Copy link
Contributor

@brandon-b-miller brandon-b-miller commented Jun 4, 2025

Closes #19035
Closes #16480

I believe this needs pola-rs/polars#22985 to pass the one remaining failing test and the column overload described here #19035 (comment)

@brandon-b-miller brandon-b-miller requested a review from a team as a code owner June 4, 2025 01:57
@github-actions github-actions bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels Jun 4, 2025
@GPUtester GPUtester moved this to In Progress in cuDF Python Jun 4, 2025
@brandon-b-miller brandon-b-miller added feature request New feature or request non-breaking Non-breaking change labels Jun 4, 2025
@vyasr
Copy link
Contributor

vyasr commented Jun 12, 2025

I assume that this PR will updated to leverage #19090 once that is merged?

col, col_width = [
child.evaluate(df, context=context) for child in self.children
]
all_gt_0 = plc.binaryop.binary_operation(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we do this introspection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @mroeschke , do we usually introspect data to error at this point? If we don't, then we risk maybe producing the wrong result in some edge cases, like a negative fill value in the zfill column overload. The scalar case is easy enough to introspect and throw, but this is a whole scan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a case where we wouldn't be able to match the result (not necessarily an error) that Polars produces?

If so, yes, I think there's some precedent on introspecting the data to raise an error. We would want to do this in _validate_input though so it can raise during translation. Plus we can do this introspection on the CPU by doing a similar option on Polars objects

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case this intends to match is the one where one of the fill elements in the column overload is negative, polars throws at runtime in this case. I suppose we want to match that behavior, but it's a performance hit to every call to the column overload :(

@brandon-b-miller
Copy link
Contributor Author

Sorry @mroeschke , just getting back to this now. I've fixed up the PR which should now be passing tests, and responded to the open question above.

Comment on lines 322 to 328
if width.value is not None and width.value < 0:
dtypestr = dtype_str_repr(width.dtype.polars)
raise InvalidOperationError(
f"conversion from `{dtypestr}` to `u64` "
f"failed in column 'literal' for 1 out of "
f"1 values: [{width.value}]"
) from None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we put this in _validate_input? Raising there will enable fallback during translation.

@mroeschke
Copy link
Contributor

Looks like py-polars/tests/unit/operations/namespaces/string/test_pad.py::test_str_zfill_unicode_not_respected may need to be added to the expected failures list in cudf/python/cudf_polars/cudf_polars/testing/plugin.py. Apparently Polars doesn't add zeros for unicode characters.

@brandon-b-miller
Copy link
Contributor Author

brandon-b-miller commented Jul 15, 2025

Polars doesn't add zeros for unicode characters.

Should this be a polars issue? Who is right?

@mroeschke
Copy link
Contributor

Should this be a polars issue? Who is right?

Given that Polars has a dedicated test checking that zeros aren't padded on unicode, I suppose they are "right" and ideally libcudf wouldn't pad unicode characters

@brandon-b-miller
Copy link
Contributor Author

Should this be a polars issue? Who is right?

Given that Polars has a dedicated test checking that zeros aren't padded on unicode, I suppose they are "right" and ideally libcudf wouldn't pad unicode characters

It may be this piece of the docs:

This method is intended for padding numeric strings. If your data contains non-ASCII characters, use [pad_start()](https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.str.pad_start.html#polars.Series.str.pad_start) instead.

Ah. I wonder if we need to do something beyond xfailing here? It seems like in the wild users would be mostly zfilling numeric data but it's still dangerous to have a behavior divergence with the GPU backend. There aren't a lot of great options though. Off the top of my head we could

  • extend the libcudf zfill api to ignore unicode optionally
  • scan the data unnecessarily in 99% of cases to make us safe from the one edge case
  • Document the issue somewhere

None of the above seem particularly amazing. I'll think on what to do in lieu of any obvious ideas I missed.

@mroeschke
Copy link
Contributor

I wonder if we need to do something beyond xfailing here?

Ah OK so I suppose the "proper" solution could use pylibcudf.strings.convert.convert_fixed_point.is_fixed_point to use with copy_if_else to used the zfilled values if the values are numeric else the old values.

@brandon-b-miller
Copy link
Contributor Author

Ah OK so I suppose the "proper" solution could use pylibcudf.strings.convert.convert_fixed_point.is_fixed_point to use with copy_if_else to used the zfilled values if the values are numeric else the old values.

Thanks- this sounds like a winner 😄

@brandon-b-miller
Copy link
Contributor Author

It looks like polars wants to pad strings like abc as well as numeric, so we can't use is_fixed_point. But I think I might see a way forward with all_characters_of_type, will try it out.

@brandon-b-miller
Copy link
Contributor Author

after playing around with this a bit I can't think of a way out of this without throwing for the cases where the APIs diverge. AFAICT they'll only definitely agree for a certain subset of characters in the source data.

@davidwendt whats the fastest way of checking if a column of strings contains only alphanumeric characters, spaces, or empty strings?

["abc", "def"],
],
)
def test_string_zfill(fill, input_strings):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run these tests with polars 1.29? You might need to version guard these tests ie if POLARS_VERSION_LT_130

@Matt711
Copy link
Contributor

Matt711 commented Jul 22, 2025

hats the fastest way of checking if a column of strings contains only alphanumeric characters, spaces, or empty strings?

Maybe cudf::strings::contains_re and then cudf::reduce

@davidwendt
Copy link
Contributor

... whats the fastest way of checking if a column of strings contains only alphanumeric characters, spaces, or empty strings?

I believe this API would cover everything except empty strings
https://docs.rapids.ai/api/cudf/stable/pylibcudf/api_docs/strings/char_types/#pylibcudf.strings.char_types.all_characters_of_type

@brandon-b-miller brandon-b-miller changed the base branch from branch-25.08 to branch-25.10 July 24, 2025 13:32
@brandon-b-miller
Copy link
Contributor Author

Added some validation and conditional xfailing based on the polars version here. We now guard for non-ascii characters in the simplest way I can think of. It's not ideal but I think we should move forward and reassess if perf concerns ever surface.

Comment on lines +545 to +547
5
if not POLARS_VERSION_LT_130
else pytest.param(5, marks=pytest.mark.xfail(reason="fixed in Polars 1.30")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
5
if not POLARS_VERSION_LT_130
else pytest.param(5, marks=pytest.mark.xfail(reason="fixed in Polars 1.30")),
pytest.param(5, marks=pytest.mark.xfail(POLARS_VERSION_LT_130, reason="fixed in Polars 1.30")),

(For a follow up PR if interested)

@mroeschke
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit afc0d5d into rapidsai:branch-25.10 Aug 5, 2025
90 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF Python Aug 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cudf-polars Issues specific to cudf-polars feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

[FEA] Support polars.Expr.str.zfill in cudf-polars [Story] Full coverage of stringfunction methods in cudf polars

5 participants