-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vicuna template may cause the tokenization mismatch warning #2494
Comments
Facing exact same issue. There are so so many tokenization mismatches that happened when I was trying to fine-tune Llama-2-13B with multi-round data. I am not even sure if I should continue the training. @Kong-Aobo Could you please tell me how you solved it? Or does it really affect training? And I really would like to know if the owner of this repo has ever encountered this problem. |
@minglii1998 The cause of the problem can be seen in my description above. However I don't know why. At present, you can try to modify Vicuna's template and change the role "ASSISTANT" to "|ASSISTANT"| or other words. And then the mismatch warning will disappear. Hope that the owner of the repo see this issue. Maybe they have encountered this situation too. |
The error of tokenization due to Vicuna template is solved by #2498. However, it seems to only work with current template. if you change the role "ASSISTANT" to "|ASSISTANT|", it fails. Obviouly, the question is caused by tokenizer.legacy. The default tokenizer.legacy of Llama-2 is False. I try change it to True. All the errors disappear. But I don't know if we should change the default setting. |
this is closed, but what was the resolution? Should we change the template, or just pull latest code? |
When I finetune Llama-2-13B, the tokenization mismatch warning occurs as follows:
This problem occurs in all multi-round dialogue data, but single-round dialogue is fine. I print related result to debug:
It seems the error occurred at encoding "ASSISTANT".
So I changed the role in Vicuna template from "ASSISTANT" to "AI" or "|ASSISTANT|" and the error disappeared.
Didn't you encounter this problem when fine-tuning Llama-2?
The text was updated successfully, but these errors were encountered: