Reuse EOS token for EOD to optimize vocabulary size and training efficiency #73
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
✨ Description
Fast-LLM previously hard-coded its own EOD token as
<|endoftext|>
. This approach often resulted in the tokenizer gaining an additional special token, unnecessarily increasing its vocabulary size. This caused several issues:</s>
), which can be reused for this purpose.<|endoftext|>
must be handled in the training configuration.This PR eliminates the addition of a special
<|endoftext|>
token and instead reuses the tokenizer's existing EOS token for the same functionality.🔍 Type of change
Select all that apply:
📝 Changes
EOS
token for EOD functionality.Tokenize documents asf"{self.tokenizer.bos_token}{text}{self.tokenizer.eos_token}"
.✅ Checklist
General
Dependencies and Configuration
Testing
Performance Impact
📊 Performance Impact Details
Training speed has improved significantly. Tests show that removing the additional token increases training throughput from 21,000 tokens/s/GPU to 26,000 tokens/s/GPU.
🗒️ Additional Notes
None.