Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detokenizer fixes #8039

Merged
merged 30 commits into from
Jul 5, 2024
Merged
Changes from 1 commit
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
eea8dfa
Add llama_detokenize()
Jun 20, 2024
d779bab
Using llama_tokenize() in tests
Jun 20, 2024
40a6660
Using llama_tokenize() in tests
Jun 20, 2024
16a7503
Fix tokenizer tests
Jun 20, 2024
03dbcc8
minor: confusing hexadecimal codepoint
Jun 20, 2024
071bf42
Clean old known problematic codepoints
Jun 20, 2024
064b35e
Update bruteforce random tests
Jun 20, 2024
503b753
Fix add_space_prefix, set false by default
Jun 20, 2024
0cc6593
Remove previous space
Jun 20, 2024
6d233bc
Remove previous space
Jun 20, 2024
b452e82
Add tokenizer flag: clean_up_tokenization_spaces
Jun 21, 2024
9af762c
tests: unexpected vocab type as test fail instead of error
Jun 23, 2024
0cf2989
tests: gracefully exit threads
Jun 23, 2024
38d54b3
tets: skip unicode surrogaes and undefined
Jun 23, 2024
44c8648
Fix detokenizer():
Jun 23, 2024
9eb0fca
Do not remove space when decoding special tokens
Jun 24, 2024
12e2c31
style: remove trailing whitespace
Jun 24, 2024
95a0df5
Bugfix: custom regexs splits undefined unicode codepoints
Jun 24, 2024
4a28063
Update brute force test:
Jun 24, 2024
9854a9c
Symetric params for llama_tokenize() and llama_detokenize()
Jun 25, 2024
107923c
Better leading space removal
Jun 25, 2024
68220fe
Update bruteforce test
Jun 25, 2024
98fc182
style : remove spaces
Jul 4, 2024
8072089
Merge commit 'f8c4c073' into detokenizer
Jul 4, 2024
8f5e1e0
'viking' detokenizer clean spaces
Jul 4, 2024
2f15019
Better leading space removal
Jul 4, 2024
11ac641
Update bruteforce test: header files location
Jul 4, 2024
4db8c0d
Update bruteforce test: add more models
Jul 4, 2024
906476f
style: spaces
Jul 4, 2024
0137683
style: spaces
jaime-m-p Jul 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
style : remove spaces
  • Loading branch information
jaime-m-p committed Jul 4, 2024
commit 98fc182312700be88f67db24156209b087d3e9d1
2 changes: 1 addition & 1 deletion llama.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5098,7 +5098,7 @@ static void llm_load_vocab(
}
}

std::sort( vocab.cache_special_tokens.begin(), vocab.cache_special_tokens.end(),
std::sort(vocab.cache_special_tokens.begin(), vocab.cache_special_tokens.end(),
[&] (const llama_vocab::id a, const llama_vocab::id b) {
return vocab.id_to_token[a].text.size() > vocab.id_to_token[b].text.size();
}
Expand Down