A Ruby gem providing unified access to various AI model tokenizers, including both LLM (Language Model) and embedding model tokenizers.
- Unified Interface: Consistent API across all tokenizers
- Multiple Model Support: Supports tokenizers for various AI models
- LLM Tokenizers: Anthropic, OpenAI, Gemini, Llama3, Qwen, Mistral
- Embedding Tokenizers: BERT, AllMpnetBaseV2, BgeLargeEn, BgeM3, MultilingualE5Large
- Common Operations: tokenize, encode, decode, size calculation, truncation
- Unicode Support: Proper handling of emoji and multibyte characters
Add this line to your application's Gemfile:
gem 'discourse_ai-tokenizers'
And then execute:
bundle install
Or install it yourself as:
gem install discourse_ai-tokenizers
require 'discourse_ai/tokenizers'
# Get token count
DiscourseAi::Tokenizers::OpenAiTokenizer.size("Hello world!")
# => 3
# Tokenize text
DiscourseAi::Tokenizers::OpenAiTokenizer.tokenize("Hello world!")
# => [9906, 1917, 0]
# Encode text to token IDs
DiscourseAi::Tokenizers::OpenAiTokenizer.encode("Hello world!")
# => [9906, 1917, 0]
# Decode token IDs back to text
DiscourseAi::Tokenizers::OpenAiTokenizer.decode([9906, 1917, 0])
# => "Hello world!"
# Truncate text to token limit
DiscourseAi::Tokenizers::OpenAiTokenizer.truncate("This is a long sentence", 5)
# => "This is a"
# Check if text is within token limit
DiscourseAi::Tokenizers::OpenAiTokenizer.below_limit?("Short text", 10)
# => true
DiscourseAi::Tokenizers::AnthropicTokenizer
- Claude modelsDiscourseAi::Tokenizers::OpenAiTokenizer
- GPT modelsDiscourseAi::Tokenizers::GeminiTokenizer
- Google GeminiDiscourseAi::Tokenizers::Llama3Tokenizer
- Meta Llama 3DiscourseAi::Tokenizers::QwenTokenizer
- Alibaba QwenDiscourseAi::Tokenizers::MistralTokenizer
- Mistral models
DiscourseAi::Tokenizers::BertTokenizer
- BERT-based modelsDiscourseAi::Tokenizers::AllMpnetBaseV2Tokenizer
- sentence-transformers/all-mpnet-base-v2DiscourseAi::Tokenizers::BgeLargeEnTokenizer
- BAAI/bge-large-enDiscourseAi::Tokenizers::BgeM3Tokenizer
- BAAI/bge-m3DiscourseAi::Tokenizers::MultilingualE5LargeTokenizer
- intfloat/multilingual-e5-large
# Get all available LLM tokenizers dynamically
DiscourseAi::Tokenizers::BasicTokenizer.available_llm_tokenizers
# => [DiscourseAi::Tokenizers::AnthropicTokenizer, DiscourseAi::Tokenizers::OpenAiTokenizer, ...]
# Strict mode ensures exact token limit compliance
DiscourseAi::Tokenizers::OpenAiTokenizer.truncate("Long text here", 5, strict: true)
# Check limits with strict mode
DiscourseAi::Tokenizers::OpenAiTokenizer.below_limit?("Text", 10, strict: true)
# Handles unicode characters properly
text = "Hello 世界 🌍 👨👩👧👦"
DiscourseAi::Tokenizers::OpenAiTokenizer.size(text)
# => 8
# Truncation preserves unicode integrity
truncated = DiscourseAi::Tokenizers::OpenAiTokenizer.truncate(text, 5)
# => "Hello 世界 🌍"
All tokenizers implement the following interface:
tokenizer
- Returns the underlying tokenizer instancetokenize(text)
- Returns array of tokens (strings or token objects)encode(text)
- Returns array of token IDs (integers)decode(token_ids)
- Converts token IDs back to textsize(text)
- Returns number of tokens in texttruncate(text, limit, strict: false)
- Truncates text to token limitbelow_limit?(text, limit, strict: false)
- Checks if text is within limit
After checking out the repo, run bin/setup
to install dependencies. Then, run rake spec
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
.
The gem includes comprehensive test suites:
# Run all tests
bundle exec rspec
# Run specific test suites
bundle exec rspec spec/discourse_ai/tokenizers/integration_spec.rb
bundle exec rspec spec/discourse_ai/tokenizers/method_consistency_spec.rb
bundle exec rspec spec/discourse_ai/tokenizers/error_handling_spec.rb
Bug reports and pull requests are welcome on GitHub. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the code of conduct.
The gem is available as open source under the terms of the MIT License.
Everyone interacting in the DiscourseAi::Tokenizers project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the code of conduct.