A hands-on guide to exploring Hugging Face tokenizers across popular LLMs like LLaMA, PHI-3, and StarCoder2. This project demonstrates how to encode, decode, and format text, code, and chat-style messages for large language models.
- π Encode and decode text with various tokenizers
- π¬ Format multi-turn chat prompts using chat templates
- π§ Compare tokenization outputs across models
- π§ͺ Visualize individual tokens and their IDs
- π§° Supports models like:
meta-llama/Meta-Llama-3.1-8B-Instructmicrosoft/phi-3-mini-4k-instructbigcode/starcoder2-15b
Tokenizers-using-HuggingFace/
βββ Tokenizers_using_HuggingFace.ipynb
βββ README.md
git clone https://github.com/your-username/Tokenizers-using-HuggingFace.git
cd Tokenizers-using-HuggingFacepip install transformersOptional for some models:
pip install torch
pip install sentencepiecefrom transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct", trust_remote_code=True)
text = "I love exploring tokenizers!"
tokens = tokenizer.encode(text)
decoded = tokenizer.batch_decode(tokens)
print(tokens)
print(decoded)This project is open-source and available under the MIT License.
Contributions, suggestions, and improvements are welcome! Feel free to open an issue or submit a pull request.
Created by Rishi Kora (https://github.com/Rishi-Kora) β feel free to reach out with questions or ideas!