Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detokenizer fixes #8039

Merged
merged 30 commits into from
Jul 5, 2024
Merged

Detokenizer fixes #8039

merged 30 commits into from
Jul 5, 2024

Conversation

jaime-m-p
Copy link
Collaborator

@jaime-m-p jaime-m-p commented Jun 20, 2024

This PR tries to solve most common problems with detokenization:

  • spaces after special tokens
  • spaces before and after apostrophes

Related issues: #8023, #7938.


@jaime-m-p jaime-m-p added the bugfix fixes an issue or bug label Jun 20, 2024
@jaime-m-p jaime-m-p marked this pull request as draft June 20, 2024 16:58
@jaime-m-p
Copy link
Collaborator Author

jaime-m-p commented Jun 20, 2024

Initial detokenizer state:

VOCABS \ TESTS: test-tokenizer-0 test-tokenizer-1-bpe test-tokenizer-1-spm
ggml-vocab-aquila.gguf ERROR OK -
ggml-vocab-baichuan.gguf ERROR - ERROR
ggml-vocab-bert-bge.gguf OK - -
ggml-vocab-command-r.gguf OK OK -
ggml-vocab-deepseek-coder.gguf ERROR OK -
ggml-vocab-deepseek-llm.gguf ERROR OK -
ggml-vocab-falcon.gguf OK OK -
ggml-vocab-gpt-2.gguf OK OK -
ggml-vocab-gpt-neox.gguf ERROR OK -
ggml-vocab-jina-v2-code.gguf OK OK -
ggml-vocab-jina-v2-de.gguf OK OK -
ggml-vocab-jina-v2-en.gguf OK - -
ggml-vocab-jina-v2-es.gguf OK OK -
ggml-vocab-llama-bpe.gguf OK OK -
ggml-vocab-llama-spm.gguf OK - ERROR
ggml-vocab-mpt.gguf ERROR OK -
ggml-vocab-olmo.gguf ERROR OK -
ggml-vocab-phi-3.gguf OK - ERROR
ggml-vocab-poro-chat.gguf OK OK -
ggml-vocab-refact.gguf OK OK -
ggml-vocab-smaug-bpe.gguf OK OK -
ggml-vocab-starcoder.gguf OK OK -

@github-actions github-actions bot added the testing Everything test related label Jun 20, 2024
@jaime-m-p
Copy link
Collaborator Author

jaime-m-p commented Jun 20, 2024

Real initial state:

VOCAB \ TESTS: test-tokenizer-0 test-tokenizer-1-bpe test-tokenizer-1-spm
ggml-vocab-aquila.gguf ERROR OK -
ggml-vocab-baichuan.gguf ERROR - ERROR
ggml-vocab-bert-bge.gguf OK - -
ggml-vocab-command-r.gguf OK OK -
ggml-vocab-deepseek-coder.gguf OK OK -
ggml-vocab-deepseek-llm.gguf OK OK -
ggml-vocab-falcon.gguf OK ERROR -
ggml-vocab-gpt-2.gguf OK OK -
ggml-vocab-gpt-neox.gguf ERROR OK -
ggml-vocab-jina-v2-code.gguf OK OK -
ggml-vocab-jina-v2-de.gguf OK OK -
ggml-vocab-jina-v2-en.gguf OK - -
ggml-vocab-jina-v2-es.gguf OK OK -
ggml-vocab-llama-bpe.gguf OK OK -
ggml-vocab-llama-spm.gguf OK - ERROR
ggml-vocab-mpt.gguf OK OK -
ggml-vocab-olmo.gguf OK OK -
ggml-vocab-phi-2.gguf OK OK -
ggml-vocab-phi-3.gguf OK - ERROR
ggml-vocab-poro-chat.gguf OK OK -
ggml-vocab-refact.gguf OK ERROR -
ggml-vocab-smaug-bpe.gguf OK OK -
ggml-vocab-stablelm2.gguf OK OK -
ggml-vocab-starcoder.gguf OK ERROR -

Add detokenizer checks
New generator: ascii_lr_strip
New generator: apostrophe
Add more vocabs files
@github-actions github-actions bot added the python python script changes label Jun 20, 2024
@jaime-m-p
Copy link
Collaborator Author

Brute force encoding and decoding tests (number of errors, * >= 10 errors):

VOCABS \ TESTS: added_lr_strip apostrophe ascii_lr_strip unicodes vocab_words
bert-bge * * * * *
deepseek-coder
deepseek-llm
falcon * *
gpt-2
jina-v2-code
jina-v2-de
jina-v2-en * * * * *
jina-v2-es
llama-bpe
llama-spm * * * * *
mpt *
olmo *
phi-2
phi-3 * * * * *
poro-chat
qwen2 * *
refact
smaug-bpe
stablelm2
starcoder

@jaime-m-p
Copy link
Collaborator Author

jaime-m-p commented Jun 20, 2024

Improvements:

VOCAB \ TESTS: test-tokenizer-0 test-tokenizer-1-bpe test-tokenizer-1-spm
ggml-vocab-aquila.gguf ERROR OK -
ggml-vocab-baichuan.gguf ERROR - OK
ggml-vocab-bert-bge.gguf OK - -
ggml-vocab-command-r.gguf OK OK -
ggml-vocab-deepseek-coder.gguf OK OK -
ggml-vocab-deepseek-llm.gguf OK OK -
ggml-vocab-falcon.gguf OK ERROR -
ggml-vocab-gpt-2.gguf OK OK -
ggml-vocab-gpt-neox.gguf ERROR OK -
ggml-vocab-jina-v2-code.gguf OK OK -
ggml-vocab-jina-v2-de.gguf OK OK -
ggml-vocab-jina-v2-en.gguf OK - -
ggml-vocab-jina-v2-es.gguf OK OK -
ggml-vocab-llama-bpe.gguf OK OK -
ggml-vocab-llama-spm.gguf OK - OK
ggml-vocab-mpt.gguf OK OK -
ggml-vocab-olmo.gguf OK OK -
ggml-vocab-phi-2.gguf OK OK -
ggml-vocab-phi-3.gguf OK - OK
ggml-vocab-poro-chat.gguf OK OK -
ggml-vocab-refact.gguf OK ERROR -
ggml-vocab-smaug-bpe.gguf OK OK -
ggml-vocab-stablelm.gguf OK OK -
ggml-vocab-starcoder.gguf OK ERROR -

@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jun 21, 2024
@cmp-nct
Copy link
Contributor

cmp-nct commented Jun 23, 2024

I gave it a test run using phi3-mini-instruct
First I ran a string through the tokenize_test.py which the model came with:
Both of those two strings tokenize identical (the \n is not tokenized, which is surprising)
Test:<|user|>This is a test<|end|>\n<|assistant|>
Test:<|user|>This is a test<|end|><|assistant|>

ID: 1, Content: '<s>'
ID: 4321, Content: '▁Test'
ID: 29901, Content: ':'
ID: 32010, Content: '<|user|>'
ID: 910, Content: '▁This'
ID: 338, Content: '▁is'
ID: 263, Content: '▁a'
ID: 1243, Content: '▁test'
ID: 32007, Content: '<|end|>'
ID: 32001, Content: '<|assistant|>'

Now I ran the same through llama.cpp tokenization:

<|assistant|>':
  4321 -> ' Test'
 29901 -> ':'
 32010 -> '<|user|>'
   910 -> ' This'
   338 -> ' is'
   263 -> ' a'
  1243 -> ' test'
 32007 -> '<|end|>'
 29871 -> ' '
    13 -> '
'
 32001 -> '<|assistant|>'

Update:
It looks like in python \n are removed if they come after a special token (which is ironic given the finetune templates ).
The add whitespace prefix probably works correct on llama.cpp IF the newlines were to be trimmed.
I don't know the logic behind those special handlings

I mimicked the python tokenizer by adding that into llama.cpp:

                        // remove potential newlines after special tokens
                        if (vocab.tokenizer_add_space_prefix & is_prev_special) {
                            while (raw_text.length() > 0 && (raw_text[0] == '\n')) {
                                raw_text = raw_text.substr(1);
                            }
                            if(raw_text.length() == 0) continue;
                        }

right before

                        // prefix with space if previous is special
                        if (vocab.tokenizer_add_space_prefix && is_prev_special) {
                            raw_text = " " + raw_text;
                        }

Update2
I made another change, this time to prevent mutating single spaces into double spaces.
Python doesn't add a space if a space is already present.

if (vocab.tokenizer_add_space_prefix && is_prev_special && raw_text.length() > 0 && !isspace(raw_text[0])){

That would result in identical tokenization:

Tokens for 'Test:<|user|>This is a test<|end|><|assistant|>':
  4321 -> ' Test'
 29901 -> ':'
 32010 -> '<|user|>'
   910 -> ' This'
   338 -> ' is'
   263 -> ' a'
  1243 -> ' test'
 32007 -> '<|end|>'
 32001 -> '<|assistant|>'

jaime-m-p added 4 commits June 23, 2024 20:49
Useful when automating tests:
 - If you don't know in advance the vocab type.
 - Differenciate other loading errors.
Using exit() is throwing random exceptions
UNKNOWN and CONTROL are 'special pieces'.
Remove space after UNKNOWN and CONTROL.
Refactor llama_token_to_piece().
@jaime-m-p
Copy link
Collaborator Author

The models baichuan, falcon and mpt have tokenizations errors, so detokenization fails too.

VOCAB \ TESTS: test-tokenizer-0 test-tokenizer-1-bpe test-tokenizer-1-spm
ggml-vocab-aquila.gguf OK OK -
ggml-vocab-baichuan.gguf ERROR - OK
ggml-vocab-bert-bge.gguf OK - -
ggml-vocab-command-r.gguf OK OK -
ggml-vocab-deepseek-coder.gguf OK OK -
ggml-vocab-deepseek-llm.gguf OK OK -
ggml-vocab-falcon.gguf OK ERROR -
ggml-vocab-gpt-2.gguf OK OK -
ggml-vocab-gpt-neox.gguf OK OK -
ggml-vocab-jina-v2-code.gguf OK OK -
ggml-vocab-jina-v2-de.gguf OK OK -
ggml-vocab-jina-v2-en.gguf OK - -
ggml-vocab-jina-v2-es.gguf OK OK -
ggml-vocab-llama-bpe.gguf OK OK -
ggml-vocab-llama-spm.gguf OK - OK
ggml-vocab-mpt.gguf ERROR OK -
ggml-vocab-olmo.gguf OK OK -
ggml-vocab-phi-2.gguf OK OK -
ggml-vocab-phi-3.gguf OK - OK
ggml-vocab-poro-chat.gguf OK OK -
ggml-vocab-refact.gguf OK OK -
ggml-vocab-smaug-bpe.gguf OK OK -
ggml-vocab-starcoder.gguf OK OK -

@jaime-m-p
Copy link
Collaborator Author

@cmp-nct

It looks like in python \n are removed if they come after a special token

Not all special tokens, see the attributes lstrip and rstrip in file tokenizer.json:

    {
      "id": 32007,
      "content": "<|end|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": TRUE,
      "normalized": false,
      "special": true
    },

You can see the lstrip and rstrip implementation #7749.
But, since this attributes are not already stored in the GGUF file, I had to hardcode model names (phi-3 and jina-*)..

llama.cpp/llama.cpp

Lines 5202 to 5217 in e112b61

// set attributes by model/tokenizer name
if (_contains_any(tokenizer_pre, {"jina-v2-de", "jina-v2-es", "jina-v2-code"})) {
_set_token_attr("<mask>", LLAMA_TOKEN_ATTR_LSTRIP, true);
} else if (_contains_any(model_name, {"phi-3", "phi3"})) {
for (auto id : vocab.cache_special_tokens) {
_set_tokenid_attr(id, LLAMA_TOKEN_ATTR_RSTRIP, true);
}
for (auto token : {"</s>"}) {
_set_token_attr(token, LLAMA_TOKEN_ATTR_RSTRIP, true);
}
for (auto token : {"<unk>", "<s>", "<|endoftext|>"}) {
_set_token_attr(token, LLAMA_TOKEN_ATTR_RSTRIP, false);
}
}
}
}

If llama.cpp is inserting a \n after <end> then there is a bug or you are using a previous version or the model name does not match.
Please, can you check?

I tried you example and got another result!
Test:<|user|>This is a test<|end|>\n<|assistant|> // input
Same tokenization: [1, 4321, 29901, 32010, 910, 338, 263, 1243, 32007, 32001].
<s> Test:<|user|> This is a test<|end|><|assistant|> // AutoTokenizer output (skip_special_tokens=False)
<s>Test:<|user|> This is a test<|end|><|assistant|> // Llama.cpp output (special=True)
Test: This is a test // AutoTokenizer output (skip_special_tokens=True)
Test: This is a test // Llama.cpp output (special=False)

@cmp-nct
Copy link
Contributor

cmp-nct commented Jun 24, 2024

@jaime-m-p
Oh, I didn't know about that behavior. Explains a lot!

  1. The \n was manually added by me based on the official template by Microsoft
    image
    But as you pointed out: END has "rstrip" true, so the newline is removed.

  2. model name special behavior
    That's great to know. It also explains why it's not working for me!
    I am testing this on a recent xtuner phi3-mini model (llava type) which comes with generic.name == "LLaMA v2"

I'll repeat the tests after fixing those issues and reverting my changes. Given your results that's promising.

It's a little troublesome that such errors can very easily sneak into a model and it's very hard to notice them, even harder to fix them without blindly recreating the model from originals.
Likely the majority of GGUF files on huggingspace are flawed in some way.

@Harsha-Nori
Copy link

Harsha-Nori commented Jun 24, 2024

Hi @jaime-m-p and @cmp-nct, really grateful you both are looking into this! I'm traveling without reliable access to a computer at the moment, but wanted to ask if these fixes now keep stability on retokenization with Phi-3 (i.e. the roundtrip of text -> tokens -> text -> tokens results in the same tokens). The constant whitespace insertion on each cycle was causing serious kv-cache reuse issues on our side and I'm really hopeful that this update resolves it!

jaime-m-p added 2 commits June 24, 2024 20:47
Detokenize special tokens.
Replace errors with '\uFFFD' when detokenizing to 'utf-8'.
More edge cases.
Better detokenization results check.
@jaime-m-p
Copy link
Collaborator Author

jaime-m-p commented Jun 24, 2024

Overall current tokenize and detokenize state.

WPM models (bert-bge, jina-v2-en) are still broken. Probably due to the unicode NFD normalization.

BPE models qwen2, olmo and mpt are probably faling due to the missing unicode NFC normalization.
See "normalizer": { "type": "NFC" } in tokenizer.json files.

All BPE and SPM models seems to detokenize properly.

Each cell show the number of tokenization and detokenization errros (up to 10). Empty cell means 0 errors.
NOTE: There are more failings tests not included in this table.

VOCABS \ TESTS: added_lr_strip apostrophe ascii_lr_strip unicodes vocab_words
baichuan 10 - 10 10 - 10 10 - 10 10 - 10 10 - 10
bert-bge 0 - 10 0 - 10 0 - 10 10 - 10 0 - 10
deepseek-coder
deepseek-llm
falcon 10 - 0 10 - 0
gpt-2
jina-v2-code
jina-v2-de
jina-v2-en 0 - 10 0 - 10 0 - 10 10 - 10 0 - 10
jina-v2-es
llama-bpe
llama-spm
mpt 10 - 0 10 - 0 10 - 0 10 - 0
olmo 10 - 0
phi-2
phi-3
poro-chat
qwen2 10 - 0 10 - 0
refact
smaug-bpe
stablelm2
starcoder

@jaime-m-p
Copy link
Collaborator Author

@Harsha-Nori

Hi @jaime-m-p and @cmp-nct, really grateful you both are looking into this! I'm traveling without reliable access to a computer at the moment, but wanted to ask if these fixes now keep stability on retokenization with Phi-3 (i.e. the roundtrip of text -> tokens -> text -> tokens results in the same tokens). The constant whitespace insertion on each cycle was causing serious kv-cache reuse issues on our side and I'm really hopeful that this update resolves it!

AutoTokenizer is not completing this roundtrip either for some models.

llama-bpe
                 ' \x00z \x07z \x0ez \x15z \x1cz  z !z "z $z %z &z (z )z *z +z ,z -'  # input text
'<|begin_of_text|> \x00z \x07z \x0ez \x15z \x1cz  z!z "z $z %z &z (z )z *z +z,z -'  # AutoTokenizer
'<|begin_of_text|> \x00z \x07z \x0ez \x15z \x1cz  z!z "z $z %z &z (z )z *z +z,z -'  # Llama.cpp
phi-3
    ' \x00z \x07z \x0ez \x15z \x1cz  z !z "z $z %z &z (z )z *z +z ,z -'  # input text
'<s>  \x00z \x07z \x0ez \x15z \x1cz  z !z "z $z %z &z (z )z *z +z ,z -'  # AutoTokenizer
'<s>  \x00z \x07z \x0ez \x15z \x1cz  z !z "z $z %z &z (z )z *z +z ,z -'  # Llama.cpp

llama-bpe removes spaces before some punctuation characters. Re-tokenization is different.
phi-3 seems more consistent, maybe removing the BOS token and the first space is enough.

Probably a few models can achieve this, but Information can be lost in tokenization (normalization, lstrip, rstrip, etc).

@Harsha-Nori
Copy link

Hmm, great point. I think what I'm really hoping for is eventual stability on the second or third tokenize/detokenize cycles -- before your PR, Phi-3 had the problem of constantly changing the token_id at index 1 (due to growing spaces), which really caused issues.

I think this set of changes is good enough to solve most of our problems :).

@jaime-m-p
Copy link
Collaborator Author

No more improvements in detokenizing.

All remaining detokenization problems are related to NFD/NFC normalization (MPT, OLMO, QWHEN2 and all WPM models).

BAICHUAN has other kind of problems, maybe due to unexpected byte tokens in the vocab.

Also tested the recently added GEMMA, VIKING and JAIS models.

Each cell show the number of tokenization and detokenization errros (up to 10). Empty cell means 0 errors.
NOTE: There are more failings tests not included in this table.

VOCABS \ TESTS: added_lr_strip apostrophe ascii_lr_strip unicodes vocab_words
baichuan 10 - 10 10 - 0 10 - 0 10 - 0 10 - 0
bert-bge 0 - 10 10 - 10 0 - 10 10 - 10 0 - 10
deepseek-coder
deepseek-llm
falcon 10 - 0 10 - 0
gemma 10 - 0 10 - 0
gemma-2 10 - 0 10 - 0
gpt-2
jais
jina-v2-code
jina-v2-de
jina-v2-en 0 - 10 10 - 10 0 - 10 10 - 10 0 - 10
jina-v2-es
llama-bpe
llama-spm
mpt 10 - 0 10 - 0 10 - 0 10 - 0
olmo 10 - 0
phi-2
phi-3
poro-chat
qwen2 10 - 0 10 - 0
refact
smaug-bpe
stablelm2
starcoder
viking 10 - 0 10 - 0

@jaime-m-p jaime-m-p marked this pull request as ready for review July 4, 2024 22:15
@jaime-m-p
Copy link
Collaborator Author

I will resume WPM and NFD/NFC normalizations later. I have been collecting more tokenization and vocab problems.

NOTE: I know nothing about SWIFT, hope this 6d233bc is correct.

@jaime-m-p jaime-m-p requested a review from ggerganov July 4, 2024 22:39
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 5, 2024
style: spaces

Update bruteforce test: add more models

Update bruteforce test: header files location

Better leading space removal

'viking' detokenizer clean spaces

style : remove spaces

Update bruteforce test

Better leading space removal

Symetric params for llama_tokenize() and llama_detokenize()

Update brute force test:

Detokenize special tokens.
Replace errors with '\uFFFD' when detokenizing to 'utf-8'.
More edge cases.
Better detokenization results check.

Bugfix: custom regexs splits undefined unicode codepoints

style: remove trailing whitespace

Do not remove space when decoding special tokens

Fix detokenizer():

UNKNOWN and CONTROL are 'special pieces'.
Remove space after UNKNOWN and CONTROL.
Refactor llama_token_to_piece().

tets: skip unicode surrogaes and undefined

tests: gracefully exit threads

Using exit() is throwing random exceptions

tests: unexpected vocab type as test fail instead of error

Useful when automating tests:
 - If you don't know in advance the vocab type.
 - Differenciate other loading errors.

Add tokenizer flag: clean_up_tokenization_spaces

Remove previous space

Remove previous space

Fix add_space_prefix, set false by default

Update bruteforce random tests

Add detokenizer checks
New generator: ascii_lr_strip
New generator: apostrophe
Add more vocabs files

Clean old known problematic codepoints

minor: confusing hexadecimal codepoint

Fix tokenizer tests

Using llama_tokenize() in tests

Using llama_tokenize() in tests

Add llama_detokenize()
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work

tests/test-tokenizer-1-bpe.cpp Outdated Show resolved Hide resolved
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@jaime-m-p jaime-m-p merged commit 213701b into ggerganov:master Jul 5, 2024
50 of 53 checks passed
@jaime-m-p
Copy link
Collaborator Author

Oh, sorry. I forgot to change commit title adding the module.

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 6, 2024
style: spaces

Update bruteforce test: add more models

Update bruteforce test: header files location

Better leading space removal

'viking' detokenizer clean spaces

style : remove spaces

Update bruteforce test

Better leading space removal

Symetric params for llama_tokenize() and llama_detokenize()

Update brute force test:

Detokenize special tokens.
Replace errors with '\uFFFD' when detokenizing to 'utf-8'.
More edge cases.
Better detokenization results check.

Bugfix: custom regexs splits undefined unicode codepoints

style: remove trailing whitespace

Do not remove space when decoding special tokens

Fix detokenizer():

UNKNOWN and CONTROL are 'special pieces'.
Remove space after UNKNOWN and CONTROL.
Refactor llama_token_to_piece().

tets: skip unicode surrogaes and undefined

tests: gracefully exit threads

Using exit() is throwing random exceptions

tests: unexpected vocab type as test fail instead of error

Useful when automating tests:
 - If you don't know in advance the vocab type.
 - Differenciate other loading errors.

Add tokenizer flag: clean_up_tokenization_spaces

Remove previous space

Remove previous space

Fix add_space_prefix, set false by default

Update bruteforce random tests

Add detokenizer checks
New generator: ascii_lr_strip
New generator: apostrophe
Add more vocabs files

Clean old known problematic codepoints

minor: confusing hexadecimal codepoint

Fix tokenizer tests

Using llama_tokenize() in tests

Using llama_tokenize() in tests

Add llama_detokenize()
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 8, 2024
* Add llama_detokenize():
  - Update header files location
  - UNKNOWN and CONTROL are 'special pieces'
  - Remove space after UNKNOWN and CONTROL
  - Refactor llama_token_to_piece()
  - Add flag: clean_up_tokenization_spaces
  - Symmetric params for llama_tokenize() and llama_detokenize()

* Update and fix tokenizer tests:
  - Using llama_detokenize()
  - Unexpected vocab type as test fail instead of error
    - Useful when automating tests:
    - If you don't know in advance the vocab type
    - Differenciate other loading errors
  - Skip unicode surrogaes and undefined
  - Gracefully exit threads
    - Using exit() is throwing random exceptions
  - Clean old known problematic codepoints
  - Minor: confusing hexadecimal codepoint

* Update bruteforce random tests
  - Add detokenizer checks
  - New generator: ascii_lr_strip
  - New generator: apostrophe
  - Add more vocabs files
  - Detokenize special tokens.
  - Replace errors with '\uFFFD' when detokenizing to 'utf-8'
  - More edge cases
  - Better detokenization results check

* Fix add_space_prefix, set false by default
* Better leading space removal
* Do not remove space when decoding special tokens
* Bugfix: custom regexs splits undefined unicode codepoints
* 'viking' detokenizer clean spaces
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 11, 2024
* Add llama_detokenize():
  - Update header files location
  - UNKNOWN and CONTROL are 'special pieces'
  - Remove space after UNKNOWN and CONTROL
  - Refactor llama_token_to_piece()
  - Add flag: clean_up_tokenization_spaces
  - Symmetric params for llama_tokenize() and llama_detokenize()

* Update and fix tokenizer tests:
  - Using llama_detokenize()
  - Unexpected vocab type as test fail instead of error
    - Useful when automating tests:
    - If you don't know in advance the vocab type
    - Differenciate other loading errors
  - Skip unicode surrogaes and undefined
  - Gracefully exit threads
    - Using exit() is throwing random exceptions
  - Clean old known problematic codepoints
  - Minor: confusing hexadecimal codepoint

* Update bruteforce random tests
  - Add detokenizer checks
  - New generator: ascii_lr_strip
  - New generator: apostrophe
  - Add more vocabs files
  - Detokenize special tokens.
  - Replace errors with '\uFFFD' when detokenizing to 'utf-8'
  - More edge cases
  - Better detokenization results check

* Fix add_space_prefix, set false by default
* Better leading space removal
* Do not remove space when decoding special tokens
* Bugfix: custom regexs splits undefined unicode codepoints
* 'viking' detokenizer clean spaces
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 13, 2024
* Add llama_detokenize():
  - Update header files location
  - UNKNOWN and CONTROL are 'special pieces'
  - Remove space after UNKNOWN and CONTROL
  - Refactor llama_token_to_piece()
  - Add flag: clean_up_tokenization_spaces
  - Symmetric params for llama_tokenize() and llama_detokenize()

* Update and fix tokenizer tests:
  - Using llama_detokenize()
  - Unexpected vocab type as test fail instead of error
    - Useful when automating tests:
    - If you don't know in advance the vocab type
    - Differenciate other loading errors
  - Skip unicode surrogaes and undefined
  - Gracefully exit threads
    - Using exit() is throwing random exceptions
  - Clean old known problematic codepoints
  - Minor: confusing hexadecimal codepoint

* Update bruteforce random tests
  - Add detokenizer checks
  - New generator: ascii_lr_strip
  - New generator: apostrophe
  - Add more vocabs files
  - Detokenize special tokens.
  - Replace errors with '\uFFFD' when detokenizing to 'utf-8'
  - More edge cases
  - Better detokenization results check

* Fix add_space_prefix, set false by default
* Better leading space removal
* Do not remove space when decoding special tokens
* Bugfix: custom regexs splits undefined unicode codepoints
* 'viking' detokenizer clean spaces
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 13, 2024
* Add llama_detokenize():
  - Update header files location
  - UNKNOWN and CONTROL are 'special pieces'
  - Remove space after UNKNOWN and CONTROL
  - Refactor llama_token_to_piece()
  - Add flag: clean_up_tokenization_spaces
  - Symmetric params for llama_tokenize() and llama_detokenize()

* Update and fix tokenizer tests:
  - Using llama_detokenize()
  - Unexpected vocab type as test fail instead of error
    - Useful when automating tests:
    - If you don't know in advance the vocab type
    - Differenciate other loading errors
  - Skip unicode surrogaes and undefined
  - Gracefully exit threads
    - Using exit() is throwing random exceptions
  - Clean old known problematic codepoints
  - Minor: confusing hexadecimal codepoint

* Update bruteforce random tests
  - Add detokenizer checks
  - New generator: ascii_lr_strip
  - New generator: apostrophe
  - Add more vocabs files
  - Detokenize special tokens.
  - Replace errors with '\uFFFD' when detokenizing to 'utf-8'
  - More edge cases
  - Better detokenization results check

* Fix add_space_prefix, set false by default
* Better leading space removal
* Do not remove space when decoding special tokens
* Bugfix: custom regexs splits undefined unicode codepoints
* 'viking' detokenizer clean spaces
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes an issue or bug examples python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants