Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse EOS token for EOD to optimize vocabulary size and training efficiency #73

Merged
merged 2 commits into from
Nov 28, 2024

Conversation

tscholak
Copy link
Collaborator

@tscholak tscholak commented Nov 28, 2024

✨ Description

Fast-LLM previously hard-coded its own EOD token as <|endoftext|>. This approach often resulted in the tokenizer gaining an additional special token, unnecessarily increasing its vocabulary size. This caused several issues:

  • Tokenizers typically already include an end-of-sequence (EOS) token (most often </s>), which can be reused for this purpose.
  • Adding one extra token to the vocabulary increases the total size, which has implications:
    • An additional embedding for <|endoftext|> must be handled in the training configuration.
    • For optimal performance, vocabulary sizes are often aligned to multiples of 8. While not strictly required, this alignment ensures efficient memory usage and faster training on modern hardware.

This PR eliminates the addition of a special <|endoftext|> token and instead reuses the tokenizer's existing EOS token for the same functionality.

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

  1. Reuse the existing EOS token for EOD functionality.
  2. Tokenize documents as f"{self.tokenizer.bos_token}{text}{self.tokenizer.eos_token}".

✅ Checklist

General

  • 📜 I have read and followed the contributing guidelines.
  • 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
  • 🎉 The functionality is complete, and I have tested the changes.
  • 📝 I have updated the documentation if needed.
  • ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
  • 🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

  • 🐋 I have updated the Docker configuration or dependencies, if applicable.
  • 🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

  • 🧪 I have added or updated tests to cover my changes.
  • ✔️ New and existing tests pass locally with my changes.
  • 🚦 I have tested these changes on GPUs and verified training stability.
  • 🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

  • 📊 I have run benchmarks where applicable to evaluate the performance impact.
  • ✅ The benchmarks show no performance regression.
  • 🚀 The benchmarks indicate a potential performance improvement.
  • ⚠️ The benchmarks indicate a potential performance degradation.
  • 📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

Training speed has improved significantly. Tests show that removing the additional token increases training throughput from 21,000 tokens/s/GPU to 26,000 tokens/s/GPU.


🗒️ Additional Notes

None.

@tscholak tscholak requested a review from RaymondLi0 November 28, 2024 18:45
Copy link
Contributor

@RaymondLi0 RaymondLi0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@tscholak tscholak merged commit 5e6de1a into main Nov 28, 2024
4 checks passed
@tscholak tscholak deleted the torsten/reuse-eos-token branch November 28, 2024 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants