Skip to content

NLPOptimize/awesome-tokenizers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation


awesome_tokenizers

Awesome-tokenizer Awesome

A repository with the 🔥 symbol is a tokenizer that is significantly faster than other tokenizers.

🔹 WordPiece Tokenizer Implementations

  • 🔥 FlashTokenizer (C++/Python)
    • The world's fastest CPU tokenizer library!

🔹 BPE (Byte Pair Encoding) Implementations

  • OpenAI TikToken (Rust/Python)
    • Official BPE tokenizer from OpenAI (used in GPT models), highly optimized.
  • huggingface/tokenizers (Rust/Python)
    • General-purpose tokenizer supporting BPE, from Hugging Face.
  • bpe-tokenizer (Rust) (Rust)
    • Rust BPE tokenizer library, identifying frequent pairs effectively.
  • YouTokenToMe (C++/Python)
    • Efficient BPE tokenizer with fast training and inference, developed by VK.com.
  • 🔥 /fastBPE (C++/Python)
    • Facebook’s fast and memory-efficient BPE tokenizer, widely used in NLP research.
  • 🔥 sentencepiece (C++/Python)
    • Google's SentencePiece implementation also provides BPE as one of the algorithms.
  • Subword-nmt (Python)
    • Python implementation commonly used in MT research, simple but slower.
  • 🔥 rs-bpe (Rust)
    • A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust

🔹 SentencePiece Implementations


Contributing

Your contributions are always welcome! Please take a look at the contribution guidelines first.

Question

Also, if you have any questions, please send a message directly to WeChat, Line, or Telegram below.

💬 LINE

💬 Telegram

💬 WeChat

About

A curated list of tokenizer libraries for blazing-fast NLP processing.

Topics

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published