A lightweight post-training quantization module built on the top of PyTorch's Modules.
- PTQ Focus: quantize of all Linear Layers (nn.Linear)
- Quantization Methods:
W8A32
(8-bit weights, 32-bit activations),W8A16
(8-bit weights, 16-bit activations),W8A8
(Coming soon!) - Model Support: PyTorch models from Hugging Face Hub
- Offline-first approach: no automatic downloads from the cloud
- Built-in benchmarking: latency and memory footprint tracking
TinyQ/
├── logs/ # Benchmark and training logs
├── models/ # Local model storage
├── tinyq.py # Core quantization library
├── utils.py # Utility functions
├── examples.py # Usage examples
└── bench.py # Benchmarking tools (Coming soon)
Note
TinyQ is built with efficiency in mind to be used at the edge (locally) on both CPU and GPU based systems.
The requirements.txt file uses a CUDA-enabled PyTorch. For systems without CUDA, please follow the PyTorch installation guide to get the correct version.
git clone https://github.com/afondiel/TinyQ.git
cd TinyQ
# Create and activate conda environment
conda create -n tinyq python>=3.8
conda activate tinyq
# Install requirements
pip install -r requirements.txt
Important
The current version works in offline-mode
only. Please, download a pytorch model from HF Hub to start with. You can also use the script below:
# Example: Download OPT-125M
huggingface-cli download --resume-download facebook/opt-125m --local-dir ./models/facebook/opt-125m
See the full Model Setup Guide for detailed instructions.
from tinyq import Quantizer
from utils import load_local_hf_model, get_generation
# Load model
model, tokenizer = load_local_hf_model("./models/facebook/opt-125m")
# Create tinyq quantizer object and additional resources
q = Quantizer()
# Quantize model (W8A32 or W8A16)
qmodel = q.quantize(model, q_method="w8a32")
# Save Quantized Model
q.export(qmodel_path, qmodel)
# Test inference
prompt = "Hello, my name is"
result = get_generation(model=qmodel,
modality="text",
input_data=prompt,
tokenizer=tokenizer)
print(result)
python examples.py \
--model_path "./models/facebook/opt-125m" \
--qm w8a32 \
--test_inference \
--qmodel_path "./qmodel"
python bench.py \
--model_path "./models/facebook/opt-125m"
- W8A32 implementation
- W8A16 implementation
- Documentation and examples
- Unit tests
- W8A8 Quantization Support
- Model Support Extensions
- Additional Layer Support
- Performance Optimization
The examples below shows a Pytorch model printout before and after applying W8A32 Quantization.
Before:
After:
You can also use a tool like NEUTRON get more in-depth insight and compare both models.
(Still to Come)
Contributions are welcome! Please see the Contributing Guidelines.
This project is licensed under the MIT License - see the LICENSE file for details.
This project started as a learning exercise from the Quantization Fundamentals course by DeepLearning.AI and Hugging Face, helping me understand the core concepts behind model quantization.
Special thanks to:
- Younes Belkada & Marc Sun for their excellent instruction and course content
- Andrew Ng and the DeepLearning.AI team for making AI education accessible and practical
- kaushikacharya for his detailed course notes that provided valuable guidance