Model Quantization Tool with NiceGUI interface for AWQ, NVFP4, and GGUF quantization methods.
-
Quantization Methods
- AWQ (Activation-aware Weight Quantization): 4-bit integer quantization
- NVFP4 (NVIDIA FP4): 4-bit floating-point quantization
- GGUF (GGML Universal File): Multiple quantization levels (Q4_K_M, Q5_K_M, Q6_K, Q8_0, etc.) using llama.cpp
-
Web Interface
- Real-time GPU monitoring with visual charts (Highcharts)
- Streaming logs during quantization
- Robust job cancellation (terminates subprocess and all children)
- Easy configuration forms
- Output model management
-
Infrastructure
- Docker support with NVIDIA GPU runtime
- CI/CD with GitHub Actions
- Local development with Pixi
- Python 3.10+
- NVIDIA GPU with CUDA support (for quantization)
- Docker with NVIDIA runtime (for containerized deployment)
# Install Pixi (if not already installed)
curl -fsSL https://pixi.sh/install.sh | bash
# Run the application
pixi run devVisit http://localhost:8080
cd docker
docker compose up --buildVisit http://localhost:8080
Environment variables:
HF_HOME: HuggingFace cache directory (default:/workspace/hf)HF_DATASETS_CACHE: Datasets cache directory (default:/workspace/hf/datasets)OUT_DIR: Output directory for quantized models (default:/workspace/out)PORT: Application port (default:8080)
msquant/
├── src/
│ └── msquant/
│ ├── app/ # NiceGUI application
│ │ ├── main.py # Application entry point
│ │ ├── pages/ # UI pages
│ │ └── components/ # Reusable UI components
│ ├── core/ # Core functionality
│ │ ├── quantizer/ # Quantization engine
│ │ └── monitoring/ # GPU monitoring
│ └── services/ # Background services
├── docker/ # Docker configuration
│ ├── Dockerfile.gpu
│ └── docker-compose.yml
├── .github/
│ └── workflows/ # CI/CD workflows
├── pixi.toml # Pixi configuration
└── README.md
# Run development server
pixi run dev
# Lint code
pixi run lint
# Format code
pixi run fmt
# Type checking
pixi run typecheck
# Run tests
pixi run testEdit pixi.toml:
- Add to
[dependencies]for conda packages - Add to
[pypi-dependencies]for PyPI packages
Then run:
pixi installNavigate to the Configure page and set:
- Model ID (e.g.,
meta-llama/Llama-3.1-8B) - Quantization method (AWQ, NVFP4, or GGUF)
- Calibration dataset settings (required for AWQ and NVFP4)
- Method-specific parameters:
- AWQ: Weight bits, group size, zero point
- NVFP4: Activation/weight schemes
- GGUF: Quantization type (Q4_K_M recommended, Q5_K_M for best quality), intermediate format (f16 default)
Note:
- AWQ and NVFP4 output formats follow llmcompressor conventions (binary or safetensors)
- GGUF produces
.gguffiles compatible with llama.cpp, Ollama, and other GGUF-compatible inference engines - GGUF quantization types:
- Q4_K_M: Recommended for balanced quality and size
- Q5_K_M: Best quality while maintaining reasonable size
- Q6_K, Q8_0: Higher precision options
- Q2_K, Q3_K: Smaller sizes with reduced quality
The Monitor page shows:
- Job status and logs with streaming updates
- Real-time GPU metrics with visual charts (utilization, memory, temperature, power)
- GPU selector for multi-GPU systems
- Cancel button to terminate running jobs
The Results page lists:
- Quantized model outputs
- Cache information
- Model sizes and paths
On PR open/update:
- Linting and type checking
- Unit tests
- Docker build (no push)
On merge to main:
- Docker image build and push to GHCR
- Tagged with
latestandsha-<commit>
Images are published to GitHub Container Registry:
# Pull latest
docker pull ghcr.io/OWNER/msquant:latest
# Pull specific commit
docker pull ghcr.io/OWNER/msquant:sha-abc1234Replace OWNER with your GitHub username/organization.
[Add your license here]
Built with:
- NiceGUI - Web interface
- llmcompressor - Quantization engine for AWQ/NVFP4
- llama.cpp - GGUF quantization and inference
- vLLM - LLM inference
- Pixi - Package management