Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series.str.find fix for pd.ArrowDtype(pa.string()) #56792

Merged
merged 16 commits into from
Feb 2, 2024

Conversation

rohanjain101
Copy link
Contributor

@rohanjain101 rohanjain101 commented Jan 9, 2024

@rohanjain101 rohanjain101 changed the title Find all fix for pd.ArrowDtype(pa.string()) Series.str.find fix for pd.ArrowDtype(pa.string()) Jan 9, 2024
@mroeschke mroeschke added Strings String extension data type and string data Arrow pyarrow functionality labels Jan 9, 2024
@rohanjain101 rohanjain101 marked this pull request as draft January 9, 2024 18:40
@rohanjain101 rohanjain101 marked this pull request as ready for review January 9, 2024 20:27
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a test for a Series with different length strings to ensure the custom implementation is robust to this case?

@rohanjain101
Copy link
Contributor Author

Extended e2e test to cover strings of different lengths.

@@ -2368,20 +2368,27 @@ def _str_fullmatch(
return self._str_match(pat, case, flags, na)

def _str_find(self, sub: str, start: int = 0, end: int | None = None) -> Self:
if start != 0 and end is not None:
if (start == 0 or start is None) and end is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance you've looked at using a slice object here already? It feels like that could help simplify a lot of the branching being done here for things being None / 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a slice object would help here, since I believe for this method, we need a positive start index, to calculate the offset of the substring from the original string. This PR essentially converts start into equivalent, but positive index, then it can be added to the resulting index returned by pc.find_substring. For example, if string is "abc" and start is -1, we calculate start_offset as 2, which is the equivalent positive index. We can use this offset to find the index of the substring in the original string, by adding it the position of the substring result.

@rohanjain101
Copy link
Contributor Author

@WillAyd @mroeschke

Any thoughts on this PR? I agree conditional code is not good, but I don't have a better solution with current arrow compute surface area. Should I close this PR if this solution is not desirable?

@mroeschke
Copy link
Member

Sorry this PR slipped through. I am OK with the changes in this PR could you merge in main once more?

@rohanjain101
Copy link
Contributor Author

Merged.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also lgtm. I don't think ci is related? @mroeschke merge whenever

@mroeschke mroeschke added this to the 3.0 milestone Feb 2, 2024
@mroeschke mroeschke merged commit 4f0870e into pandas-dev:main Feb 2, 2024
46 of 47 checks passed
@mroeschke
Copy link
Member

Thanks @rohanjain101

@rohanjain101 rohanjain101 deleted the find_all branch February 2, 2024 23:51
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
* fix find

* gh reference

* add test for Nones

* fix min version compat

* restore test

* improve test cases

* fix empty string

* inline

* improve tests

* fix

* Revert "fix"

This reverts commit 7fa21eb.

* fix

* merge

* inline

---------

Co-authored-by: Rohan Jain <rohanjain@microsoft.com>
@jorisvandenbossche jorisvandenbossche modified the milestones: 3.0, 2.3 Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Series.str.find with pd.ArrowDtype(pa.string()) produces unexpected results
4 participants