Missing tokenizer tests #3730

goerch · 2023-10-22T20:00:20Z

AFAIU we are missing tokenizer tests for supported models like

Baichuan
Bloom
GptNeoX
Persimmon
Refact
Starcoder

It would be great if anyone would be helping out.

Galunid · 2023-10-22T21:01:43Z

If it's the same deal as what I did in the stablelm PR, I can do that tomorrow

Galunid · 2023-10-22T21:47:29Z

Should we apply this fix to conversion scripts of other gpt2-tokenizer based models before generating the vocab files?

for i in range(vocab_size):
-    tokens.append(reverse_vocab[i] if i in reverse_vocab else f"[PAD{i}]")
-    scores.append(0.0) # dummy
-    toktypes.append(gguf.TokenType.NORMAL)
+    if i not in reverse_vocab:
+        tokens.append(f"[PAD{i}]")
+        toktypes.append(gguf.TokenType.USER_DEFINED)
+    elif reverse_vocab[i] in added_vocab:
+        # NOTE: wouldn't we like to distinguish CONTROL tokens here?
+        tokens.append(reverse_vocab[i])
+        toktypes.append(gguf.TokenType.USER_DEFINED)
+    else:
+        tokens.append(reverse_vocab[i])
+        toktypes.append(gguf.TokenType.NORMAL)

goerch · 2023-10-22T22:01:09Z

Should we apply this fix to conversion scripts of other gpt2-tokenizer based models before generating the vocab files?

Good idea, but no: let us see and fix the issues accordingly.

goerch · 2023-10-23T13:24:13Z

@Galunid : I retested conversion of mpt and tested conversion of gpt-neox with the following code

added_vocab = tokenizer.get_added_vocab()
reverse_vocab = {id: encoded_tok for encoded_tok, id in tokenizer.vocab.items()}

for i in range(vocab_size):
    if i not in reverse_vocab:
        tokens.append(f"[PAD{i}]")
        toktypes.append(gguf.TokenType.USER_DEFINED)
    elif reverse_vocab[i] in added_vocab:
        tokens.append(reverse_vocab[i])
        if tokenizer.added_tokens_decoder[i].special:
            toktypes.append(gguf.TokenType.CONTROL)
        else:
            toktypes.append(gguf.TokenType.USER_DEFINED)
    else:
        tokens.append(reverse_vocab[i])
        toktypes.append(gguf.TokenType.NORMAL)

gguf_writer.add_token_list(tokens)
gguf_writer.add_token_types(toktypes)

incoorperating @jploski 's explanation of how to detect special tokens. Tests work still (mpt) and now(gpt-neox). So I think it is fine for me if you go ahead and fix the conversion scripts of the gpt2-tokenizer this way. Thanks for your help!

Galunid · 2023-10-23T14:00:21Z

Sure, to clarify, you mean to rework all the gpt2 conversion scripts and update the vocabs in the test PR?

goerch · 2023-10-23T14:11:08Z

Sure, to clarify, you mean to rework all the gpt2 conversion scripts and update the vocabs in the test PR?

As you like and your time permits. I hope someone will object if this is the wrong way to go.

ggerganov added help wanted Extra attention is needed testing Everything test related labels Oct 22, 2023

Galunid mentioned this issue Oct 23, 2023

Add more tokenizer tests #3742

Merged

6 tasks

Galunid mentioned this issue Oct 23, 2023

Update special token handling in conversion scripts for gpt2 derived tokenizers #3746

Merged

This was referenced Oct 23, 2023

CausalLM: Llama + vocab.json BPE tokenizer = error loading model: cannot find tokenizer merges in model file #3732

Closed

gguf-py: Add support for loading merges.txt #3743

Closed

goerch closed this as completed in #3742 Oct 24, 2023

Galunid mentioned this issue Oct 24, 2023

StableLM support #3586

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing tokenizer tests #3730

Missing tokenizer tests #3730

goerch commented Oct 22, 2023

Galunid commented Oct 22, 2023

Galunid commented Oct 22, 2023

goerch commented Oct 22, 2023

goerch commented Oct 23, 2023 •

edited

Loading

Galunid commented Oct 23, 2023 •

edited

Loading

goerch commented Oct 23, 2023 •

edited

Loading

Missing tokenizer tests #3730

Missing tokenizer tests #3730

Comments

goerch commented Oct 22, 2023

Galunid commented Oct 22, 2023

Galunid commented Oct 22, 2023

goerch commented Oct 22, 2023

goerch commented Oct 23, 2023 • edited Loading

Galunid commented Oct 23, 2023 • edited Loading

goerch commented Oct 23, 2023 • edited Loading

goerch commented Oct 23, 2023 •

edited

Loading

Galunid commented Oct 23, 2023 •

edited

Loading

goerch commented Oct 23, 2023 •

edited

Loading