Skip to content

Tokenization problem, unexpected <unk> tokens #938

Closed
@andreaskoepf

Description

System Info

Cargo version: 1.70.0
Commit sha: 211b54a
Docker label: N/A
nvidia-smi: Driver Version: 525.105.17 CUDA Version: 12.0

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

  1. launch tgi: text-generation-launcher --model-id OpenAssistant/llama2-70b-oasst-sft-v10 -p 8080 --quantize bitsandbytes-nf4
  2. when ready, query with: curl localhost:8080/generate -X POST -d $'{\"inputs\":\"user\\nHello assistant!<|im_end|>\\n<|im_start|>assistant\\n\",\"parameters\":{\"max_new_tokens\":10, \"do_sample\": true, \"decoder_input_details\": true}}' -H 'Content-Type: application/json'

Problem: The end of the input text sequence <|im_start|>assistant\n is incorrectly tokenized into

 {"id":32005,"text":"<|im_start|>"},
 {"id":0,"text":"<unk>"},
 {"id":13,"text":"<0x0A>"}

Full returned ouput is (should not start with "assistant", formatted for clarity):

{
    "generated_text": "assistant\nHello! How can I help you",
    "details": {
        "finish_reason": "length",
        "generated_tokens": 10,
        "seed": 1809714102295104059,
        "prefill": [
            {
                "id": 32005,
                "text": "<|im_start|>",
                "logprob": null
            },
            {
                "id": 1404,
                "text": "user",
                "logprob": -9.8984375
            },
            {
                "id": 13,
                "text": "<0x0A>",
                "logprob": -3.8105469
            },
            {
                "id": 10994,
                "text": "Hello",
                "logprob": -13.703125
            },
            {
                "id": 20255,
                "text": "assistant",
                "logprob": -4.8320312
            },
            {
                "id": 29991,
                "text": "!",
                "logprob": -1.2890625
            },
            {
                "id": 32006,
                "text": "<|im_end|>",
                "logprob": -6.9804688
            },
            {
                "id": 32005,
                "text": "<|im_start|>",
                "logprob": -11.6171875
            },
            {
                "id": 0,
                "text": "<unk>",
                "logprob": -26.5
            },
            {
                "id": 13,
                "text": "<0x0A>",
                "logprob": 0.0
            }
        ],
        "tokens": [
            {
                "id": 32005,
                "text": " <|im_start|>",
                "logprob": 0.0,
                "special": true
            },
            {
                "id": 20255,
                "text": " assistant",
                "logprob": 0.0,
                "special": false
            },
            {
                "id": 13,
                "text": "\n",
                "logprob": 0.0,
                "special": false
            },
            {
                "id": 10994,
                "text": "Hello",
                "logprob": -0.5229492,
                "special": false
            },
            {
                "id": 29991,
                "text": "!",
                "logprob": -0.23022461,
                "special": false
            },
            {
                "id": 1128,
                "text": " How",
                "logprob": -0.25,
                "special": false
            },
            {
                "id": 508,
                "text": " can",
                "logprob": -0.09472656,
                "special": false
            },
            {
                "id": 306,
                "text": " I",
                "logprob": -0.0018663406,
                "special": false
            },
            {
                "id": 1371,
                "text": " help",
                "logprob": -0.4296875,
                "special": false
            },
            {
                "id": 366,
                "text": " you",
                "logprob": -0.042816162,
                "special": false
            }
        ]
    }
}

Expected behavior

The input text <|im_start|>assistant\n should be tokenized and processed as:

 {"id":32005,"text":"<|im_start|>"},
 {"id":20255,"text":"assistant"},
 {"id":13,"text":"<0x0A>"}

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions