-
Notifications
You must be signed in to change notification settings - Fork 613
Validate tokenizer and model alignment before training #2074
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Validate tokenizer and model alignment before training #2074
Conversation
|
Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
torchtitan/models/utils.py
Outdated
| if hasattr(model_args, "vocab_size"): | ||
| tokenizer_vocab_size = tokenizer.get_vocab_size() | ||
| model_vocab_size = model_args.vocab_size | ||
| if tokenizer_vocab_size != model_vocab_size: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contribution! We also noticed this and we are actually not require the tokenzier's vocab size is the same as model's vocab size. The tokenizer's vocab size defines the input space and related to dataset, while the model's vocab size (the embedding layer size) defines the model's representation space.
In general, it' user's responsibility to use the correct tokenizer and model embedding layer dimension
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the mismatch is intentional.
E.g. if you have several phases of training, pretraining -> finetuning, then in pretraining, you don't need to have the tokenizer you use for finetuning, but in modeling you need to create an embedding table large enough so it has capacity to be trained later with custom finetuning tokenizers.
So the requirement here is model.vocab_size > tokenizer.vocab_size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review.
You're right, strict equality was definitely an oversight on my part.
(I was biased by my own workflow where I generate HF configs based on the tokenizer's vocab size, but i realize now that shouldn't be enforced generally here.)
However, tianyu-I mentioned, I do think it's valuable to verify that model.vocab_size > tokenizer.vocab_size.
While the training loop would eventually crash on a mismatch, it typicall y results in a vague CUDA Error, which is hard to debug. A proactive check here would provide a much more informative error message for users.
I've updated the PR to only enforce that model has sufficient capacity for the tokenizer.
torchtitan/models/utils.py
Outdated
| if hasattr(model_args, "vocab_size"): | ||
| tokenizer_vocab_size = tokenizer.get_vocab_size() | ||
| model_vocab_size = model_args.vocab_size | ||
| if tokenizer_vocab_size != model_vocab_size: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the mismatch is intentional.
E.g. if you have several phases of training, pretraining -> finetuning, then in pretraining, you don't need to have the tokenizer you use for finetuning, but in modeling you need to create an embedding table large enough so it has capacity to be trained later with custom finetuning tokenizers.
So the requirement here is model.vocab_size > tokenizer.vocab_size.
torchtitan/models/utils.py
Outdated
| ) | ||
|
|
||
| # Validate eos_id | ||
| if hasattr(model_args, "eos_id"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
model_args shouldn't have the field eos_id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was initially concerned that the default value of eos_id in Qwen3ModelArgs and TransformerModelArgs might differ from the actual tokenizer.eos_id.
However, upon double-checking, it seems tokenizer.eos_id is effectively always used regardless of the model args. Since this validation logic seems redundant, I decided to remove it.
Quick question though: Do you happen to know the purpose of having eos_id in Qwen3ModelArgs and TransformerModelArgs in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's a mistake. It's never used. Please help remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification.
I have removed eos_id from both Qwen3ModelArgs and TransformerModelArgs as requested.
First of all, thank you for creating this amazing project. I discovered a potential improvement regarding tokenizer usage while working with the codebase and have implemented the changes below.
Summary
I have added a validation step to ensure the provided Tokenizer's configuration matches the model_args before training begins.
Motivation & Context
It appears that users can utilize a Hugging Face style Custom Tokenizer by providing it as an Asset. However, I noticed an issue when using predefined Model args (such as Qwen3-8B).
While the training process itself proceeds without any immediate errors, a size mismatch issue occurs later when loading the model via transformers after HF conversion. This happens because the Custom Tokenizer's attributes do not align with the fixed model arguments used during training.
Changes
To prevent this silent failure, I implemented a check that verifies if the vocab_size and eos_id of the Tokenizer provided as an Asset match the current model_args.