Skip to content

Fix stripping strings containing Unicode characters #707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 24, 2021
Merged

Fix stripping strings containing Unicode characters #707

merged 2 commits into from
May 24, 2021

Conversation

Narsil
Copy link
Collaborator

@Narsil Narsil commented May 19, 2021

Fixes #706

  • Includes a failing tests + fixed it.

  • This function could maybe be optimized, we're scanning the string 3 times now.
    and once fully for chars. (not done here)

  • Fixes linked to rutc 1.52 are in another PR.

@Narsil Narsil requested a review from n1t0 May 19, 2021 14:35
@n1t0 n1t0 changed the title Strip seems to have been broken for a while on unicode strings. Fix stripping strings containing Unicode characters May 24, 2021
- Includes a failing tests + fixed it.
- This function could maybe b optimized, we're scanning the string 3 times now.
  and once fully for chars.
Copy link
Contributor

@n1t0 n1t0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Narsil for fixing this!

@n1t0 n1t0 merged commit c046da7 into master May 24, 2021
@n1t0 n1t0 deleted the fix_strip branch May 24, 2021 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tokenizers.normalizers.Strip can not remove all whitespace characters on the right side of chinese character
2 participants