Vicuna template may cause the tokenization mismatch warning #2494

Kong-Aobo · 2023-09-29T07:44:45Z

When I finetune Llama-2-13B, the tokenization mismatch warning occurs as follows:

WARNING: tokenization mismatch: 903 vs. 902. (ignored)
WARNING: tokenization mismatch: 890 vs. 888. (ignored)
WARNING: tokenization mismatch: 542 vs. 541. (ignored)
WARNING: tokenization mismatch: 275 vs. 274. (ignored)
WARNING: tokenization mismatch: 1264 vs. 1262. (ignored)
WARNING: tokenization mismatch: 286 vs. 277. (ignored)
WARNING: tokenization mismatch: 352 vs. 349. (ignored)
WARNING: tokenization mismatch: 789 vs. 786. (ignored)
WARNING: tokenization mismatch: 281 vs. 279. (ignored)
...

This problem occurs in all multi-round dialogue data, but single-round dialogue is fine. I print related result to debug:

# Take a two-round dialog data as an example. The first paragraph of <unk> is the systom prompt, the second is user prompt.
# Note that the second round! The first token "It" was masked by mistake. 
<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>
 The increased pressure is 
...
 The pressure difference is two times the Laplace pressure $2\gamma/R$.</s>
<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk> is true, that the ideas employed here are the same that would explain why the pressure inside a balloon is higher than outside. In orde... Consequently, the fact that your have to blow quite a bit means that there is quite some pressure inside the balloon.</s>

It seems the error occurred at encoding "ASSISTANT".
So I changed the role in Vicuna template from "ASSISTANT" to "AI" or "|ASSISTANT|" and the error disappeared.

Didn't you encounter this problem when fine-tuning Llama-2?

The text was updated successfully, but these errors were encountered:

minglii1998 · 2023-10-04T18:43:39Z

Facing exact same issue. There are so so many tokenization mismatches that happened when I was trying to fine-tune Llama-2-13B with multi-round data. I am not even sure if I should continue the training.

@Kong-Aobo Could you please tell me how you solved it? Or does it really affect training?

And I really would like to know if the owner of this repo has ever encountered this problem.

Kong-Aobo · 2023-10-05T05:14:31Z

@minglii1998 The cause of the problem can be seen in my description above. However I don't know why. At present, you can try to modify Vicuna's template and change the role "ASSISTANT" to "|ASSISTANT"| or other words. And then the mismatch warning will disappear. Hope that the owner of the repo see this issue. Maybe they have encountered this situation too.

Kong-Aobo · 2023-10-09T09:14:48Z

The error of tokenization due to Vicuna template is solved by #2498. However, it seems to only work with current template. if you change the role "ASSISTANT" to "|ASSISTANT|", it fails.

Obviouly, the question is caused by tokenizer.legacy. The default tokenizer.legacy of Llama-2 is False. I try change it to True. All the errors disappear. But I don't know if we should change the default setting.

varung · 2023-10-25T06:10:22Z

this is closed, but what was the resolution? Should we change the template, or just pull latest code?

Kong-Aobo closed this as completed Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vicuna template may cause the tokenization mismatch warning #2494

Vicuna template may cause the tokenization mismatch warning #2494

Kong-Aobo commented Sep 29, 2023

minglii1998 commented Oct 4, 2023

Kong-Aobo commented Oct 5, 2023

Kong-Aobo commented Oct 9, 2023

varung commented Oct 25, 2023

Vicuna template may cause the tokenization mismatch warning #2494

Vicuna template may cause the tokenization mismatch warning #2494

Comments

Kong-Aobo commented Sep 29, 2023

minglii1998 commented Oct 4, 2023

Kong-Aobo commented Oct 5, 2023

Kong-Aobo commented Oct 9, 2023

varung commented Oct 25, 2023