Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wpm : portable unicode tolower #6305

Merged
merged 7 commits into from
Mar 26, 2024
Merged

wpm : portable unicode tolower #6305

merged 7 commits into from
Mar 26, 2024

Conversation

cebtenzzre
Copy link
Collaborator

@cebtenzzre cebtenzzre commented Mar 25, 2024

This is a portable implementation of Unicode tolower for BERT embeddings models that use the WPM tokenizer.

We need this because we can't assume that the en_US.UTF-8 locale is available, see #5740 (comment).

Wikitext tokenizer diff with this change (same as before):

--- good_tokens.txt	2024-03-25 16:26:39.506423621 -0400
+++ lcpp_tokens.txt	2024-03-25 16:26:39.970426376 -0400
@@ -200554,7 +200554,6 @@
 1337: ष
 29870: ##ल
 29869: ##र
-29879: ##ो
 29863: ##न
 1317: ग
 1000: "
@@ -200633,7 +200632,11 @@
 29836: ##و
 29817: ##ت
 25573: ##ا
-100: [UNK]
+1282: س
+23673: ##ل
+29836: ##و
+15394: ##د
+29836: ##و
 23856: kota
 16183: sal
 6784: ##ud

@cebtenzzre cebtenzzre marked this pull request as ready for review March 25, 2024 20:27
cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Mar 25, 2024
excludes unicodedata.cpp split

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
unicodedata.cpp Outdated Show resolved Hide resolved
@cebtenzzre cebtenzzre merged commit 32c8486 into master Mar 26, 2024
57 of 58 checks passed
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
Also use C locale for ispunct/isspace, and split unicode-data.cpp from unicode.cpp.
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 3, 2024
Also use C locale for ispunct/isspace, and split unicode-data.cpp from unicode.cpp.
tybalex pushed a commit to rubra-ai/tools.cpp that referenced this pull request Apr 17, 2024
Also use C locale for ispunct/isspace, and split unicode-data.cpp from unicode.cpp.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants