feat: Isolate and optimize GGUF quantization #10

codewithdark-git · 2025-05-26T06:49:38Z

This commit introduces a focused GGUF quantization method with significant enhancements to memory efficiency, performance, and benchmarking capabilities.

Key changes include:

quantllm/quant/gguf.py:
- Implemented a memory-efficient and performant GGUFQuantizer.
- Added robust error handling for quantization parameters and device-specific issues.
- Removed dependencies on AWQ/GPTQ from the GGUF quantizer.
- Improved GGUF file conversion logic.
quantllm/utils/benchmark.py:
- Significantly enhanced QuantizationBenchmark to provide comprehensive metrics.
- Added tracking for peak/final memory usage, memory efficiency, and model compression ratio.
- Implemented detailed latency measurements (mean, p90, p95, p99).
- Added throughput calculation (tokens/sec, inferences/sec).
- Integrated GPU utilization monitoring (mean, peak) using pynvml.
- Included a proper warm-up phase for benchmarking.
- Structured output format for easy parsing of results.
quantllm/utils/memory_tracker.py:
- Implemented MemoryTracker for detailed logging of GPU and CPU memory usage at various checkpoints during processes.
test/test_gguf_quantization.py:
- Created a new test suite for GGUF quantization.
- Includes unit tests for various GGUF configurations (bits, group_size).
- Includes integration tests with different models ("facebook/opt-125m", "facebook/opt-350m").
- Added benchmark validation tests to verify the QuantizationBenchmark utility.
- Added basic memory leak tests.
benchmark/run_benchmarks.py:
- Created a new script to run comprehensive GGUF benchmarks.
- Supports benchmarking multiple models and GGUF configurations.
- Outputs results in a structured format, leveraging the enhanced QuantizationBenchmark.

General Improvements:

Ensured clean code structure and added/updated documentation across all modified files.
Focused on efficient memory management, including strategic deletion of large objects, garbage collection, and CUDA cache clearing.
Improved error handling with clear messages.

This commit introduces a focused GGUF quantization method with significant enhancements to memory efficiency, performance, and benchmarking capabilities. Key changes include: 1. **`quantllm/quant/gguf.py`**: * Implemented a memory-efficient and performant `GGUFQuantizer`. * Added robust error handling for quantization parameters and device-specific issues. * Removed dependencies on AWQ/GPTQ from the GGUF quantizer. * Improved GGUF file conversion logic. 2. **`quantllm/utils/benchmark.py`**: * Significantly enhanced `QuantizationBenchmark` to provide comprehensive metrics. * Added tracking for peak/final memory usage, memory efficiency, and model compression ratio. * Implemented detailed latency measurements (mean, p90, p95, p99). * Added throughput calculation (tokens/sec, inferences/sec). * Integrated GPU utilization monitoring (mean, peak) using pynvml. * Included a proper warm-up phase for benchmarking. * Structured output format for easy parsing of results. 3. **`quantllm/utils/memory_tracker.py`**: * Implemented `MemoryTracker` for detailed logging of GPU and CPU memory usage at various checkpoints during processes. 4. **`test/test_gguf_quantization.py`**: * Created a new test suite for GGUF quantization. * Includes unit tests for various GGUF configurations (bits, group_size). * Includes integration tests with different models ("facebook/opt-125m", "facebook/opt-350m"). * Added benchmark validation tests to verify the `QuantizationBenchmark` utility. * Added basic memory leak tests. 5. **`benchmark/run_benchmarks.py`**: * Created a new script to run comprehensive GGUF benchmarks. * Supports benchmarking multiple models and GGUF configurations. * Outputs results in a structured format, leveraging the enhanced `QuantizationBenchmark`. General Improvements: * Ensured clean code structure and added/updated documentation across all modified files. * Focused on efficient memory management, including strategic deletion of large objects, garbage collection, and CUDA cache clearing. * Improved error handling with clear messages.

codewithdark-git merged commit afe8571 into main May 26, 2025
1 check passed

codewithdark-git deleted the feat/gguf-quantization-optim branch May 27, 2025 20:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: Isolate and optimize GGUF quantization #10

feat: Isolate and optimize GGUF quantization #10

Uh oh!

codewithdark-git commented May 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

feat: Isolate and optimize GGUF quantization #10

feat: Isolate and optimize GGUF quantization #10

Uh oh!

Conversation

codewithdark-git commented May 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants