Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infill Incorrect Tokenization #3503

Closed
kherud opened this issue Oct 6, 2023 · 15 comments · Fixed by #3508
Closed

Infill Incorrect Tokenization #3503

kherud opened this issue Oct 6, 2023 · 15 comments · Fixed by #3508
Labels
bug Something isn't working high priority Very important issue

Comments

@kherud
Copy link
Contributor

kherud commented Oct 6, 2023

I am comparing the tokenization of the codellama repository with the infill example of this repository.

The first example prompt from the codellama repository consists of the strings:

  • Prefix: 'def remove_non_ascii(s: str) -> str:\n """ '
  • Suffix: '\n return result\n'

Comparing the tokenization of both implementations results in:

  • CodeLlama: 1 32007 822 3349 29918 5464 29918 294 18869 29898 29879 29901 851 29897 1599 851 29901 13 1678 9995 29871 32008 13 1678 736 1121 13 32009
  • Llama.cpp: 32007 1 822 3349 29918 5464 29918 294 18869 29898 29879 29901 851 29897 1599 851 29901 13 1678 9995 29871 32008 1 29871 13 1678 736 1121 13 32009

There are two differences:

  • The first two tokens are swapped (those are prefix_id and bos I think)
  • Llama.cpp adds a bos token again after the suffix_id token and an additional 29871 (is this a space?)

I believe the latter is definitely wrong, as the paper states on page 4:

To limit the distribution shift between autoregressive and infilling training, we suppress the implicit leading space that SentencePiece tokenizers add upon encoding the middle part and the suffix

@ggerganov ggerganov added bug Something isn't working high priority Very important issue labels Oct 6, 2023
@vvhg1
Copy link
Contributor

vvhg1 commented Oct 6, 2023

Could you please try this and see if the issue is fixed there? Would be great if you could also test server with infill.

PS for some reason I don't seem to be getting an additional space, or could it be a newline that is also deleted in the paper?

@kherud
Copy link
Contributor Author

kherud commented Oct 6, 2023

Here are the additional tests:

codellama:
1 32007 822 3349 29918 5464 29918 294 18869 29898 29879 29901 851 29897 1599 851 29901 13 1678 9995 29871 32008 13 1678 736 1121 13 32009

vvhg1/llama.cpp infill.cpp:
1 32007 822 3349 29918 5464 29918 294 18869 29898 29879 29901 851 29897 1599 851 29901 13 1678 9995 29871 32008 29871 13 1678 736 1121 13 32009

vvhg1/llama.cpp server.cpp:
1 32007 822 3349 29918 5464 29918 294 18869 29898 29879 29901 851 29897 1599 851 29901 13 1678 9995 29871 32008 29871 13 1678 736 1121 13 32009

ggerganov/llama.cpp server.cpp:
32007 1 822 3349 29918 5464 29918 294 18869 29898 29879 29901 851 29897 1599 851 29901 13 1678 9995 29871 32008 1 29871 13 1678 736 1121 13 32009

Apart from the additional 29871 your fix seems correct. I think it is a space, since the prefix string ends in a space and this is the same token as the one before the suffix starts (32008).

@ggerganov
Copy link
Owner

Here are some sample results using tests/test-tokenizer-0-llama.py which tokenizes strings using the original Python tokenizer:

{ "\t"                    , {   29871,     12, }, },
{ "\n"                    , {   29871,     13, }, },
{ "\n return"             , {   29871,     13,    736, }, },
{ "\t\n"                  , {   29871,     12,     13, }, },
{ "Hello world"           , {   15043,   3186, }, },
{ " Hello world"          , {   29871,  15043,   3186, }, },
{ "Hello World"           , {   15043,   2787, }, },
{ " Hello World"          , {   29871,  15043,   2787, }, },
{ " Hello World!"         , {   29871,  15043,   2787,  29991, }, },
{ "Hello, world!"         , {   15043,  29892,   3186,  29991, }, },
{ " Hello, world!"        , {   29871,  15043,  29892,   3186,  29991, }, },

It looks like codellama requires some extra logic for removing the leading spaces when they occur.
This logic should be implemented as part of the infill example

@vvhg1
Copy link
Contributor

vvhg1 commented Oct 6, 2023

aggreed, I'm on it, I guess we limit the cleaning to leading spaces as it seems newlines are not removed in the original?

@kherud
Copy link
Contributor Author

kherud commented Oct 6, 2023

I think the additional space gets introduced by the llama.cpp tokenizer, a quick look suggests those lines are responsible:

llama.cpp/llama.cpp

Lines 5220 to 5221 in 9ca79d5

// without adding this leading whitespace, we do not get the same results as the original tokenizer
raw_text = " " + raw_text;

The code probably relies on the assumption that a prompt always consists of a single part, which doesn't hold here.

@vvhg1
Copy link
Contributor

vvhg1 commented Oct 6, 2023

for some reason I get different tokens for newlines:
'':1, ' <PRE>':32007, ' def':822, ' remove':3349, '_':29918, 'non':5464, '_':29918, 'as':294, 'cii':18869, '(':29898, 's':29879, ':':29901, ' str':851, ')':29897, ' ->':1599, ' str':851, ':\':3583, 'n':29876, ' """':9995, ' ':29871, ' <SUF>':32008, ' \':320, 'n':29876, ' return':736, ' result':1121, '\':29905, 'n':29876, ' <MID>':32009
exact command:

./infill -t 10 -ngl 0 -m models/codellama-13b.Q5_K_S.gguf -c 4096 --temp 0.7 --repeat_penalty 1.1 -n 20 --in-prefix 'def remove_non_ascii(s: str) -> str:\n """ ' --in-suffix '\n return result\n'

I only see the 29871 when I add leading whitespace to the suffix (which we want to clean in any case I guess)

@kherud
Copy link
Contributor Author

kherud commented Oct 6, 2023

I used CodeLlama-7b-Instruct. Maybe that makes the difference?

@ggerganov
Copy link
Owner

@vvhg1 You are missing -e

@kherud Without this code, we don't get the same output as the Python tokenizer. The info in my comment is generated with the Python tokenizer, so it is ground truth. Any extra handling of the tokens should happen outside the tokenizer implementation

@vvhg1
Copy link
Contributor

vvhg1 commented Oct 6, 2023

thx, the -e now gives me identical results

what I do now is:

  • remove leading spaces from suffix input
  • if -e then remove leading space token from tokenized suffix
    All done in the infill case so it doesn't touch anything else
    Please test, hope it works as intended now...

@kherud
Copy link
Contributor Author

kherud commented Oct 6, 2023

I think cleaning any leading spaces would be wrong, since they might legitimately be part of the suffix.

Codellama uses this little method:

def encode_infilling(self, s: str) -> List[int]:
    """Encode a string without an implicit leading space."""
    return self.sp_model.encode("☺" + s)[2:]

With an explanation here:

We could use any prefix to add under the condition that its last token does not merge to the right with any other token. In our vocabulary, the emoji achieves that. Obviously this does not extend to other tokenizers, and the clean solution would be to add a switch to SentencePiece whether to suppress the implicit leading space upon a single encoding (there is a global switch but not a per-prompt one).

The issue also points to the solution of the transformers library, I think is the relevant part:

https://github.com/huggingface/transformers/blob/5af2c6269672cda01c24ad48fab13f14a3ffb746/src/transformers/models/code_llama/tokenization_code_llama.py#L287C14-L301

I'm not too familiar with the tokenization, but might there be situations, where " " + raw_text gives undesired results, depending on whether the space merges with the first characters of raw_text to a different token? If there is always just a single 29871 added, it should suffice to simply remove this individual token, right?

@vvhg1
Copy link
Contributor

vvhg1 commented Oct 6, 2023

Yes, I think the topic is slightly more complicated than I initially thought.

What we probably need to do is to have a look first how we encode because only the -e adds a space (at least it seems to me like that rn). Then in case of -e we either cut one space at the beginning of the string if it starts with a space (that will then get added again by the tokenizer) or if there is none we remove the single space token after tokenization.

The rationale behind this is that multiple spaces are tokens in their own right, and I guess it would be more complicated (at least to maintain) to translate a n spaces token at the beginning of the string into a n-1 spaces token.

@vvhg1
Copy link
Contributor

vvhg1 commented Oct 7, 2023

What I do now:
If params.escape (as only then the tokenizer prepends an extra space, in server seems to be the default behaviour and there is no option for it either)

case 1: and the suffix string starts with a space

  • we remove it
  • the tokenizer adds an extra space again giving us the original suffix. (I confirmed that the tokenizer first adds the space and then tokenizes so eg double space are not tokenized as 29871, 29871 but correctly as 259)
  • all done

case 2: no leading space in original suffix string

  • we tokenize the string, the tokenizer prepends a 29871
  • we remove the prepended 29871
  • all done

Please test thoroughly

@shibe2
Copy link
Contributor

shibe2 commented Oct 9, 2023

Since SPM started inserting space unconditionally (#2810), I have to use a hack similar to Code Llama's. I've found that in models that I use, newline (LF) character does not stick to any other tokens, and its identifier is easily queryable with llama_token_nl.

@ggerganov
Copy link
Owner

Seems to work now. One minor thing I noticed that I don't understand:

I ran the original test from this issue:

make -j && ./infill -t 4 -m models/codellama-7b/ggml-model-f16.gguf -c 4096 --temp 0.0 --repeat_penalty 1.1 --in-prefix 'def remove_non_ascii(s: str) -> str:\n """ ' --in-suffix '\n return result\n' --verbose-prompt -e -ngl 1

ggml_metal_add_buffer: allocated 'data            ' buffer, size = 12853.98 MB, (12854.61 / 147456.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  2050.00 MB, (14904.61 / 147456.00)
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =   288.02 MB, (15192.62 / 147456.00)

system_info: n_threads = 4 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 

main: prompt: ''
main: number of tokens in prompt = 26
     1 -> ''
 32007 -> ' <PRE>'
   822 -> ' def'
  3349 -> ' remove'
 29918 -> '_'
  5464 -> 'non'
 29918 -> '_'
   294 -> 'as'
 18869 -> 'cii'
 29898 -> '('
 29879 -> 's'
 29901 -> ':'
   851 -> ' str'
 29897 -> ')'
  1599 -> ' ->'
   851 -> ' str'
 29901 -> ':'
    13 -> '
'
  9995 -> ' """'
 29871 -> ' '
 32008 -> ' <SUF>'
    13 -> '
'
   736 -> ' return'
  1121 -> ' result'
    13 -> '
'
 32009 -> ' <MID>'

sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 0



#####  Infill mode  #####

 <PRE> def remove_non_ascii(s: str) -> str:
 """  <SUF>
 return result
 <MID>Remove non-ASCII characters from a string.

 Args:
     s (str): The input string.

 Returns:
     str: The output string.
 """
 import re
 pattern = re.compile(r'[^\x00-\x7F]+', re.UNICODE)
 result = pattern.sub('', s) <EOT> <EOT>
llama_print_timings:        load time =     720.32 ms
llama_print_timings:      sample time =      50.62 ms /    81 runs   (    0.62 ms per token,  1600.19 tokens per second)
llama_print_timings: prompt eval time =      54.77 ms /    26 tokens (    2.11 ms per token,   474.72 tokens per second)
llama_print_timings:        eval time =    1935.91 ms /    80 runs   (   24.20 ms per token,    41.32 tokens per second)
llama_print_timings:       total time =    2089.48 ms

I don't have the token 1678 in the suffix tokenization:

32008 13 736 1121 13

In contrast, OP gets:

32008 13 1678 736 1121 13

So not sure why you guys got the extra 1678 there in your tests.

@vvhg1
Copy link
Contributor

vvhg1 commented Oct 10, 2023

I think it should be ok since it is the same behaviour here and in Python.
1678 is three spaces btw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high priority Very important issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants