Tokenization problem, unexpected <unk> tokens

### System Info

Cargo version: 1.70.0
Commit sha: 211b54ac41cae9a369f3d74bd6cc666ff4a0c526
Docker label: N/A
nvidia-smi:  Driver Version: 525.105.17   CUDA Version: 12.0 

### Information

- [ ] Docker
- [X] The CLI directly

### Tasks

- [X] An officially supported command
- [ ] My own modifications

### Reproduction

1. launch tgi: `text-generation-launcher --model-id OpenAssistant/llama2-70b-oasst-sft-v10 -p 8080 --quantize bitsandbytes-nf4`
2. when ready, query with: `curl localhost:8080/generate -X POST -d $'{\"inputs\":\"user\\nHello assistant!<|im_end|>\\n<|im_start|>assistant\\n\",\"parameters\":{\"max_new_tokens\":10, \"do_sample\": true, \"decoder_input_details\": true}}' -H 'Content-Type: application/json'`

Problem: The end of the input text sequence `<|im_start|>assistant\n` is incorrectly tokenized into

 ```json
  {"id":32005,"text":"<|im_start|>"},
  {"id":0,"text":"<unk>"},
  {"id":13,"text":"<0x0A>"}
  ```


Full returned ouput is (should not start with "assistant", formatted for clarity):
```json
{
    "generated_text": "assistant\nHello! How can I help you",
    "details": {
        "finish_reason": "length",
        "generated_tokens": 10,
        "seed": 1809714102295104059,
        "prefill": [
            {
                "id": 32005,
                "text": "<|im_start|>",
                "logprob": null
            },
            {
                "id": 1404,
                "text": "user",
                "logprob": -9.8984375
            },
            {
                "id": 13,
                "text": "<0x0A>",
                "logprob": -3.8105469
            },
            {
                "id": 10994,
                "text": "Hello",
                "logprob": -13.703125
            },
            {
                "id": 20255,
                "text": "assistant",
                "logprob": -4.8320312
            },
            {
                "id": 29991,
                "text": "!",
                "logprob": -1.2890625
            },
            {
                "id": 32006,
                "text": "<|im_end|>",
                "logprob": -6.9804688
            },
            {
                "id": 32005,
                "text": "<|im_start|>",
                "logprob": -11.6171875
            },
            {
                "id": 0,
                "text": "<unk>",
                "logprob": -26.5
            },
            {
                "id": 13,
                "text": "<0x0A>",
                "logprob": 0.0
            }
        ],
        "tokens": [
            {
                "id": 32005,
                "text": " <|im_start|>",
                "logprob": 0.0,
                "special": true
            },
            {
                "id": 20255,
                "text": " assistant",
                "logprob": 0.0,
                "special": false
            },
            {
                "id": 13,
                "text": "\n",
                "logprob": 0.0,
                "special": false
            },
            {
                "id": 10994,
                "text": "Hello",
                "logprob": -0.5229492,
                "special": false
            },
            {
                "id": 29991,
                "text": "!",
                "logprob": -0.23022461,
                "special": false
            },
            {
                "id": 1128,
                "text": " How",
                "logprob": -0.25,
                "special": false
            },
            {
                "id": 508,
                "text": " can",
                "logprob": -0.09472656,
                "special": false
            },
            {
                "id": 306,
                "text": " I",
                "logprob": -0.0018663406,
                "special": false
            },
            {
                "id": 1371,
                "text": " help",
                "logprob": -0.4296875,
                "special": false
            },
            {
                "id": 366,
                "text": " you",
                "logprob": -0.042816162,
                "special": false
            }
        ]
    }
}
```

### Expected behavior

The input text `<|im_start|>assistant\n` should be tokenized and processed as:

```json
 {"id":32005,"text":"<|im_start|>"},
 {"id":20255,"text":"assistant"},
 {"id":13,"text":"<0x0A>"}
 ```
 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenization problem, unexpected <unk> tokens #938

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tokenization problem, unexpected <unk> tokens #938

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions