Skip to content

[BUG] Series.str.findall returns DataFrame instead of Series of lists. #10226

Closed

Description

assert_eq(ps.str.findall("Monkey")[1][0], gs.str.findall("Monkey")[0][1])

I noticed the indexing is different here in our tests. We should be asserting that the entire results are equal, not just the one value that matches the regex. While looking into that, I found that pandas returns a Series containing lists of strings, while cuDF returns a DataFrame with multiple columns if multiple matches are found. We should adopt pandas' convention.

Tested on branch-22.04, commit 4e8cb4f.

>>> test_data = ["Lion", "Monkey", "Rabbit", "Don\nkey"]
>>> ps = pd.Series(test_data)
>>> gs = cudf.Series(test_data)
>>> ps.str.findall("Monkey")
0          []
1    [Monkey]
2          []
3          []
dtype: object
>>> gs.str.findall("Monkey")
        0
0    <NA>
1  Monkey
2    <NA>
3    <NA>

Originally posted by @bdice in #10208 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

PythonAffects Python cuDF API.bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions