Skip to content

Conversation

@alexdowad
Copy link
Contributor

Similar to the fast, specialized mb_strcut implementation for UTF-8 in 1f0cf13, this new implementation of mb_strcut for UTF-16 strings just examines a few bytes before each cut point.

Even for short strings, the new implementation is around 2x faster. For strings around 10,000 bytes in length, it comes out about 100-500x faster in my microbenchmarks.

The new implementation behaves identically to the old one on valid UTF-16 strings; a fuzzer was used to help verify this.

@Girgias @cmb69 @youkidearitai @kamil-tekiela @iluuu1994

Similar to the fast, specialized mb_strcut implementation for UTF-8
in 1f0cf13, this new implementation of mb_strcut for UTF-16 strings
just examines a few bytes before each cut point.

Even for short strings, the new implementation is around 2x faster.
For strings around 10,000 bytes in length, it comes out about 100-500x
faster in my microbenchmarks.

The new implementation behaves identically to the old one on valid
UTF-16 strings; a fuzzer was used to help verify this.
Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some questions to make sure I understand everything, but the implementation looks sound to me.

@alexdowad
Copy link
Contributor Author

Landed on master.

@alexdowad alexdowad closed this Oct 28, 2023
@alexdowad alexdowad deleted the cututf16 branch October 28, 2023 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants