You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Optimize regular expressions used for splitting by ~20% (#234)
By combining the contractions to a single non-capturing group prefixed
by `'`, we can speed up matches by roughly 20%.
By using possessive quantifiers for the `cl100k_base` in the word and
punctuation groups we're avoiding some backtracking.
The last whitespace groups can also be simplified to have a single
newline matched explicitly, since the previous whitespace would already
match it.
Overall the regex matches the exact same sequence of characters as
before for any case and for unicode sequences.
Co-authored-by: Lőrinc <lorinc.pap@gmail.com>
0 commit comments