Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing tokenizer tests #3730

Closed
goerch opened this issue Oct 22, 2023 · 6 comments · Fixed by #3742
Closed

Missing tokenizer tests #3730

goerch opened this issue Oct 22, 2023 · 6 comments · Fixed by #3742
Labels
help wanted Extra attention is needed testing Everything test related

Comments

@goerch
Copy link
Collaborator

goerch commented Oct 22, 2023

AFAIU we are missing tokenizer tests for supported models like

  • Baichuan
  • Bloom
  • GptNeoX
  • Persimmon
  • Refact
  • Starcoder

It would be great if anyone would be helping out.

@ggerganov ggerganov added help wanted Extra attention is needed testing Everything test related labels Oct 22, 2023
@Galunid
Copy link
Collaborator

Galunid commented Oct 22, 2023

If it's the same deal as what I did in the stablelm PR, I can do that tomorrow

@Galunid
Copy link
Collaborator

Galunid commented Oct 22, 2023

Should we apply this fix to conversion scripts of other gpt2-tokenizer based models before generating the vocab files?

for i in range(vocab_size):
-    tokens.append(reverse_vocab[i] if i in reverse_vocab else f"[PAD{i}]")
-    scores.append(0.0) # dummy
-    toktypes.append(gguf.TokenType.NORMAL)
+    if i not in reverse_vocab:
+        tokens.append(f"[PAD{i}]")
+        toktypes.append(gguf.TokenType.USER_DEFINED)
+    elif reverse_vocab[i] in added_vocab:
+        # NOTE: wouldn't we like to distinguish CONTROL tokens here?
+        tokens.append(reverse_vocab[i])
+        toktypes.append(gguf.TokenType.USER_DEFINED)
+    else:
+        tokens.append(reverse_vocab[i])
+        toktypes.append(gguf.TokenType.NORMAL)

@goerch
Copy link
Collaborator Author

goerch commented Oct 22, 2023

Should we apply this fix to conversion scripts of other gpt2-tokenizer based models before generating the vocab files?

Good idea, but no: let us see and fix the issues accordingly.

@Galunid Galunid mentioned this issue Oct 23, 2023
6 tasks
@goerch
Copy link
Collaborator Author

goerch commented Oct 23, 2023

@Galunid : I retested conversion of mpt and tested conversion of gpt-neox with the following code

added_vocab = tokenizer.get_added_vocab()
reverse_vocab = {id: encoded_tok for encoded_tok, id in tokenizer.vocab.items()}

for i in range(vocab_size):
    if i not in reverse_vocab:
        tokens.append(f"[PAD{i}]")
        toktypes.append(gguf.TokenType.USER_DEFINED)
    elif reverse_vocab[i] in added_vocab:
        tokens.append(reverse_vocab[i])
        if tokenizer.added_tokens_decoder[i].special:
            toktypes.append(gguf.TokenType.CONTROL)
        else:
            toktypes.append(gguf.TokenType.USER_DEFINED)
    else:
        tokens.append(reverse_vocab[i])
        toktypes.append(gguf.TokenType.NORMAL)

gguf_writer.add_token_list(tokens)
gguf_writer.add_token_types(toktypes)

incoorperating @jploski 's explanation of how to detect special tokens. Tests work still (mpt) and now(gpt-neox). So I think it is fine for me if you go ahead and fix the conversion scripts of the gpt2-tokenizer this way. Thanks for your help!

@Galunid
Copy link
Collaborator

Galunid commented Oct 23, 2023

Sure, to clarify, you mean to rework all the gpt2 conversion scripts and update the vocabs in the test PR?

@goerch
Copy link
Collaborator Author

goerch commented Oct 23, 2023

Sure, to clarify, you mean to rework all the gpt2 conversion scripts and update the vocabs in the test PR?

As you like and your time permits. I hope someone will object if this is the wrong way to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed testing Everything test related
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants