forked from LostRuins/koboldcpp
-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Detokenizer fixes ggerganov#8039 (28)
style: spaces Update bruteforce test: add more models Update bruteforce test: header files location Better leading space removal 'viking' detokenizer clean spaces style : remove spaces Update bruteforce test Better leading space removal Symetric params for llama_tokenize() and llama_detokenize() Update brute force test: Detokenize special tokens. Replace errors with '\uFFFD' when detokenizing to 'utf-8'. More edge cases. Better detokenization results check. Bugfix: custom regexs splits undefined unicode codepoints style: remove trailing whitespace Do not remove space when decoding special tokens Fix detokenizer(): UNKNOWN and CONTROL are 'special pieces'. Remove space after UNKNOWN and CONTROL. Refactor llama_token_to_piece(). tets: skip unicode surrogaes and undefined tests: gracefully exit threads Using exit() is throwing random exceptions tests: unexpected vocab type as test fail instead of error Useful when automating tests: - If you don't know in advance the vocab type. - Differenciate other loading errors. Add tokenizer flag: clean_up_tokenization_spaces Remove previous space Remove previous space Fix add_space_prefix, set false by default Update bruteforce random tests Add detokenizer checks New generator: ascii_lr_strip New generator: apostrophe Add more vocabs files Clean old known problematic codepoints minor: confusing hexadecimal codepoint Fix tokenizer tests Using llama_tokenize() in tests Using llama_tokenize() in tests Add llama_detokenize()
- Loading branch information
Showing
11 changed files
with
1,358 additions
and
143 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.