Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vicuna template may cause the tokenization mismatch warning #2494

Closed
Kong-Aobo opened this issue Sep 29, 2023 · 4 comments
Closed

Vicuna template may cause the tokenization mismatch warning #2494

Kong-Aobo opened this issue Sep 29, 2023 · 4 comments

Comments

@Kong-Aobo
Copy link

When I finetune Llama-2-13B, the tokenization mismatch warning occurs as follows:

WARNING: tokenization mismatch: 903 vs. 902. (ignored)
WARNING: tokenization mismatch: 890 vs. 888. (ignored)
WARNING: tokenization mismatch: 542 vs. 541. (ignored)
WARNING: tokenization mismatch: 275 vs. 274. (ignored)
WARNING: tokenization mismatch: 1264 vs. 1262. (ignored)
WARNING: tokenization mismatch: 286 vs. 277. (ignored)
WARNING: tokenization mismatch: 352 vs. 349. (ignored)
WARNING: tokenization mismatch: 789 vs. 786. (ignored)
WARNING: tokenization mismatch: 281 vs. 279. (ignored)
...

This problem occurs in all multi-round dialogue data, but single-round dialogue is fine. I print related result to debug:

# Take a two-round dialog data as an example. The first paragraph of <unk> is the systom prompt, the second is user prompt.
# Note that the second round! The first token "It" was masked by mistake. 
<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>
 The increased pressure is 
...
 The pressure difference is two times the Laplace pressure $2\gamma/R$.</s>
<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk> is true, that the ideas employed here are the same that would explain why the pressure inside a balloon is higher than outside. In orde... Consequently, the fact that your have to blow quite a bit means that there is quite some pressure inside the balloon.</s>

It seems the error occurred at encoding "ASSISTANT".
So I changed the role in Vicuna template from "ASSISTANT" to "AI" or "|ASSISTANT|" and the error disappeared.

Didn't you encounter this problem when fine-tuning Llama-2?

@minglii1998
Copy link

Facing exact same issue. There are so so many tokenization mismatches that happened when I was trying to fine-tune Llama-2-13B with multi-round data. I am not even sure if I should continue the training.

@Kong-Aobo Could you please tell me how you solved it? Or does it really affect training?

And I really would like to know if the owner of this repo has ever encountered this problem.

@Kong-Aobo
Copy link
Author

@minglii1998 The cause of the problem can be seen in my description above. However I don't know why. At present, you can try to modify Vicuna's template and change the role "ASSISTANT" to "|ASSISTANT"| or other words. And then the mismatch warning will disappear. Hope that the owner of the repo see this issue. Maybe they have encountered this situation too.

@Kong-Aobo
Copy link
Author

The error of tokenization due to Vicuna template is solved by #2498. However, it seems to only work with current template. if you change the role "ASSISTANT" to "|ASSISTANT|", it fails.

Obviouly, the question is caused by tokenizer.legacy. The default tokenizer.legacy of Llama-2 is False. I try change it to True. All the errors disappear. But I don't know if we should change the default setting.

@varung
Copy link

varung commented Oct 25, 2023

this is closed, but what was the resolution? Should we change the template, or just pull latest code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants