Skip to content

ToCount: Lightweight Token Estimator

License

openscilab/tocount

ToCount: Lightweight Token Estimator


PyPI version built with Python3 GitHub repo size Discord Channel

Overview

ToCount is a lightweight and extensible Python library for estimating token counts from text inputs using both rule-based and machine learning methods. Designed for flexibility, speed, and accuracy, ToCount provides a unified interface for different estimation strategies, making it ideal for tasks like prompt analysis, token budgeting, and optimizing interactions with token-based systems.

PyPI Counter
Github Stars
Branch main dev
CI
Code Quality CodeFactor

Installation

PyPI

Source code

Models

Rule-Based

Model Name MAE MSE
RULE_BASED.UNIVERSAL 106.70 381,647.81 0.8175
RULE_BASED.GPT_4 152.34 571,795.89 0.7266
RULE_BASED.GPT_3_5 161.93 652,923.59 0.6878

Tiktoken R50K

Model Name MAE MSE
TIKTOKEN_R50K.LINEAR_ALL 71.38 183897.01 0.8941
TIKTOKEN_R50K.LINEAR_ENGLISH 23.35 14127.92 0.9887

Tiktoken CL100K

Model Name MAE MSE
TIKTOKEN_CL100K.LINEAR_ALL 41.85 47949.48 0.9545
TIKTOKEN_CL100K.LINEAR_ENGLISH 21.12 17597.20 0.9839

Tiktoken O200K

Model Name MAE MSE
TIKTOKEN_O200K.LINEAR_ALL 25.53 20195.32 0.9777
TIKTOKEN_O200K.LINEAR_ENGLISH 20.24 15887.99 0.9859

ℹ️ The training and testing dataset is taken from Lmsys-chat-1m [1] and Wildchat [2].

Usage

>>> from tocount import estimate_text_tokens, TextEstimator
>>> estimate_text_tokens("How are you?", estimator=TextEstimator.RULE_BASED.UNIVERSAL)
4

Issues & bug reports

Just fill an issue and describe it. We'll check it ASAP! or send an email to tocount@openscilab.com.

  • Please complete the issue template

You can also join our discord server

Discord Channel

References

1- Zheng, Lianmin, et al. "Lmsys-chat-1m: A large-scale real-world llm conversation dataset." International Conference on Learning Representations (ICLR) 2024 Spotlights.
2- Zhao, Wenting, et al. "Wildchat: 1m chatgpt interaction logs in the wild." International Conference on Learning Representations (ICLR) 2024 Spotlights.

Show your support

Star this repo

Give a ⭐️ if this project helped you!

Donate to our project

If you do like our project and we hope that you do, can you please support us? Our project is not and is never going to be working for profit. We need the money just so we can continue doing what we do ;-) .

ToCount Donation

About

ToCount: Lightweight Token Estimator

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages