Skip to content

diesimo-ai/TinyQ

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TinyQ

A lightweight post-training quantization module built on the top of PyTorch's Modules.

Key features

  • PTQ Focus: quantize of all Linear Layers (nn.Linear)
  • Quantization Methods: W8A32 (8-bit weights, 32-bit activations), W8A16 (8-bit weights, 16-bit activations), W8A8 (Coming soon!)
  • Model Support: PyTorch models from Hugging Face Hub
  • Offline-first approach: no automatic downloads from the cloud
  • Built-in benchmarking: latency and memory footprint tracking

Project Structure

TinyQ/
├── logs/              # Benchmark and training logs
├── models/            # Local model storage
├── tinyq.py           # Core quantization library
├── utils.py           # Utility functions
├── examples.py        # Usage examples
└── bench.py           # Benchmarking tools (Coming soon)

Quick Start

1. Installation

Note

TinyQ is built with efficiency in mind to be used at the edge (locally) on both CPU and GPU based systems.

The requirements.txt file uses a CUDA-enabled PyTorch. For systems without CUDA, please follow the PyTorch installation guide to get the correct version.

git clone https://github.com/afondiel/TinyQ.git
cd TinyQ

# Create and activate conda environment
conda create -n tinyq python>=3.8
conda activate tinyq

# Install requirements
pip install -r requirements.txt

2. Download a Model

Important

The current version works in offline-mode only. Please, download a pytorch model from HF Hub to start with. You can also use the script below:

# Example: Download OPT-125M
huggingface-cli download --resume-download facebook/opt-125m --local-dir ./models/facebook/opt-125m

See the full Model Setup Guide for detailed instructions.

3. Run Quantization

from tinyq import Quantizer
from utils import load_local_hf_model, get_generation

# Load model
model, tokenizer = load_local_hf_model("./models/facebook/opt-125m")

# Create tinyq quantizer object and additional resources
q = Quantizer()

# Quantize model (W8A32 or W8A16)
qmodel = q.quantize(model, q_method="w8a32")

# Save Quantized Model
q.export(qmodel_path, qmodel)

# Test inference
prompt = "Hello, my name is"
result = get_generation(model=qmodel, 
                        modality="text", 
                        input_data=prompt, 
                        tokenizer=tokenizer)

print(result)

Usage

1. CLI Mode

python examples.py \
    --model_path "./models/facebook/opt-125m" \
    --qm w8a32 \
    --test_inference \
    --qmodel_path "./qmodel"

2. Run Performance Benchmarking

python bench.py \
    --model_path "./models/facebook/opt-125m"

Roadmap

Current Focus

  • W8A32 implementation
  • W8A16 implementation
  • Documentation and examples
  • Unit tests

Core Features

  • W8A8 Quantization Support
  • Model Support Extensions
  • Additional Layer Support
  • Performance Optimization

Demo

The examples below shows a Pytorch model printout before and after applying W8A32 Quantization.

Before:

After:

You can also use a tool like NEUTRON get more in-depth insight and compare both models.

Benchmark Demo

(Still to Come)

Contributing

Contributions are welcome! Please see the Contributing Guidelines.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

This project started as a learning exercise from the Quantization Fundamentals course by DeepLearning.AI and Hugging Face, helping me understand the core concepts behind model quantization.

Special thanks to:

  • Younes Belkada & Marc Sun for their excellent instruction and course content
  • Andrew Ng and the DeepLearning.AI team for making AI education accessible and practical
  • kaushikacharya for his detailed course notes that provided valuable guidance