Skip to content

Explore how Hugging Face tokenizers work across models like LLaMA, PHI-3, and StarCoder2. Includes examples for encoding, decoding, chat formatting, and token visualization. Ideal for understanding text preprocessing in LLMs.

License

Notifications You must be signed in to change notification settings

Rishi-Kora/Tokenizers-using-HuggingFace

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Tokenizers-using-HuggingFace

A hands-on guide to exploring Hugging Face tokenizers across popular LLMs like LLaMA, PHI-3, and StarCoder2. This project demonstrates how to encode, decode, and format text, code, and chat-style messages for large language models.


πŸ“Œ Features

  • πŸ”„ Encode and decode text with various tokenizers
  • πŸ’¬ Format multi-turn chat prompts using chat templates
  • 🧠 Compare tokenization outputs across models
  • πŸ§ͺ Visualize individual tokens and their IDs
  • 🧰 Supports models like:
    • meta-llama/Meta-Llama-3.1-8B-Instruct
    • microsoft/phi-3-mini-4k-instruct
    • bigcode/starcoder2-15b

πŸ“‚ Folder Structure

Tokenizers-using-HuggingFace/
β”œβ”€β”€ Tokenizers_using_HuggingFace.ipynb
└── README.md

πŸš€ Getting Started

1. Clone the repository

git clone https://github.com/your-username/Tokenizers-using-HuggingFace.git
cd Tokenizers-using-HuggingFace

2. Install dependencies

pip install transformers

Optional for some models:

pip install torch
pip install sentencepiece

πŸ§ͺ Example Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct", trust_remote_code=True)
text = "I love exploring tokenizers!"
tokens = tokenizer.encode(text)
decoded = tokenizer.batch_decode(tokens)

print(tokens)
print(decoded)

🧠 License

This project is open-source and available under the MIT License.


🀝 Contributing

Contributions, suggestions, and improvements are welcome! Feel free to open an issue or submit a pull request.


πŸ“¬ Contact

Created by Rishi Kora (https://github.com/Rishi-Kora) – feel free to reach out with questions or ideas!

About

Explore how Hugging Face tokenizers work across models like LLaMA, PHI-3, and StarCoder2. Includes examples for encoding, decoding, chat formatting, and token visualization. Ideal for understanding text preprocessing in LLMs.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published