-
Notifications
You must be signed in to change notification settings - Fork 921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate strings contains
operations to pylibcudf
#15880
Migrate strings contains
operations to pylibcudf
#15880
Conversation
return plc.interop.from_arrow(pa_target_col) | ||
|
||
|
||
@pytest.fixture(params=["A"], scope="module") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll need to add some more actual tests here.
cpdef Column contains_re( | ||
Column input, | ||
RegexProgram prog | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docstring needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might also want to start the sphinx docs for the strings submodule for pylibcudf.
As far as I can tell, there's no .rst files for strings in the api_docs/pylibcudf folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few sphinx docs here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good with two small questions.
raise ValueError("Do not instantiate RegexProgram directly, use create") | ||
|
||
@staticmethod | ||
def create(object pattern, object flags): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def create(object pattern, object flags): | |
def create(str pattern, RegexFlags flags): |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My experimentation leads me to believe it must be done this way: 758755c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you always have to use the cimport
ed name when typing Cython functions, not the imported name. regex_flags
!= RegexFlags
because the latter is imported rather than cimported. This is deliberate! We import the aliased name because that is the name we want to expose to users of pylibcudf in pure Python, using Python naming conventions for classes (CapsCase) since enums in Python are classes, while regex_flags
is the snake_case representation of the Cython/C enum.
if isinstance(flags, (int, RegexFlags)): | ||
c_flags = flags | ||
with nogil: | ||
c_prog = regex_program.create(c_pattern, c_flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Whose job is it to ensure that the provided input is a "valid" regex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The libcudf method should validate the pattern at the c++ level:
cudf::logic_error If pattern is invalid or contains unsupported features
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some simple tests of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think that's reasonable. Will add some
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of tests at b15588a
This commit 758755c broke tests it seems. Not sure why yet, but this causes cuDF python tests to fail when passing a value that is outside the enum range but represents the result of a bitwise or of valid values. I think we can't type this signature this strongly because a value really can be an object, like a python integer. Typing the signature as |
EDIT: Nvm, I misunderstood. I guess maybe just type it as an int then? |
cef5d27 types the flags as an int, but requires a cast - I can't honestly tell if it feels better this way or not. |
Cool, this LGTM then. One final note: I don't think we necessarily need to rework the docs in this PR, but just wanted to give you a heads up that I'm probably going to rework the docs in #15839 (where I'm doing replace). |
@wence- any final thoughts here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Brandon
Thanks @lithomas1 , I merged the latest and plugged the docs into the structure you built out in the other PR. Should be good to go now. |
Ah seems there's a small doc error somewhere:
Will look into whats missing. EDIT: I think this was an extra file hanging around. |
/merge |
This PR adds cudf-polars code for evaluating the `StringFunction.Contains` expression node. Depends on #15880 Authors: - https://github.com/brandon-b-miller - Lawrence Mitchell (https://github.com/wence-) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #15918
This PR creates pylibcudf strings
contains
APIs and migrates the cuDF cython to leverage them. Part of #15162.