docs: add quantization and energy efficiency guide by hongping-zh · Pull Request #1882 · bitsandbytes-foundation/bitsandbytes

hongping-zh · 2026-02-24T08:02:09Z

Summary

Adds a new documentation page explaining the energy efficiency implications of different quantization configurations, based on systematic benchmarking across multiple GPU architectures.

This PR addresses the documentation request from @TimDettmers in #1867:

"A documentation PR adding guidance on when quantization may not improve energy efficiency would be welcome."

What this guide covers

INT8 mixed-precision decomposition overhead: Why default LLM.int8() may increase energy consumption by 17–33% vs FP16, and why this is a justified accuracy trade-off
Why threshold=0.0 is not recommended: Perplexity data showing +25% degradation vs only −3% energy savings
NF4 small model overhead: Dequantization cost exceeding memory bandwidth savings for models <5B parameters
Crossover point: ~5B parameters, validated across RTX 5090 and RTX 4090D
Batch size impact: 84–96% energy reduction from BS=1 to BS=8–64
Configuration guidelines: Recommendations organized by priority (memory/accuracy/energy) and model size

Key data points

Configuration	Perplexity (Yi-1.5-6B)	Δ vs FP16	Energy Δ vs FP16
FP16 (baseline)	11.16	—	—
INT8 Default (threshold=6.0)	11.20	+0.33%	+32.7%
INT8 Pure (threshold=0.0)	14.00	+25.38%	−3.1%

Key takeaway: The default threshold=6.0 does an excellent job preserving accuracy (+0.33% PPL). The energy overhead is the justified cost of mixed-precision decomposition.

Methodology

NVML power monitoring at 10 Hz, n=10 per configuration, CV < 3%
Hardware: RTX 5090, RTX 4090D, A800
Models: Yi-1.5-6B, Mistral-7B, Phi-3-mini, Qwen2.5-7B, TinyLlama-1.1B, Qwen2-1.5B, Qwen2.5-3B, Qwen2-7B
Perplexity: WikiText-2 test split
Full data: https://github.com/hongping-zh/ecocompute-ai

File changes

Added: docs/source/explanations/quantization_performance.mdx

Notes

This guide validates the current default configuration rather than suggesting changes
The .mdx format follows existing documentation style in docs/source/explanations/
Happy to adjust scope, framing, or placement based on maintainer feedback

This PR adds a comprehensive energy efficiency guide for INT8 quantization, detailing its impact on energy consumption and providing recommendations for optimization based on recent benchmarking results.

matthewdouglas · 2026-02-24T16:00:34Z

docs/source/quantization_performance.mdx

+# bitsandbytes Documentation PR Draft
+
+## PR Title
+Add Energy Efficiency Guide for INT8 Quantization
+
+## PR Description
+
+### Summary
+This PR adds a comprehensive energy efficiency guide to help users understand and optimize the energy consumption of INT8 quantization.
+
+### Motivation
+Recent benchmarking on consumer GPUs (RTX 4090D, RTX 5090) revealed that **default LLM.int8() configuration can increase energy consumption by 17-33%** compared to FP16, contrary to common assumptions. This guide helps users:
+
+1. Understand the energy implications of different INT8 configurations
+2. Choose appropriate settings for their use cases
+3. Avoid unintended energy waste in production deployments
+
+### Changes
+- Added `docs/source/guides/energy_efficiency.md`
+- Added energy efficiency section to main documentation index
+- Included benchmark results and recommendations
+
+### References
+- Benchmark repository: https://github.com/hongping-zh/ecocompute-ai
+- Interactive dashboard: https://hongping-zh.github.io/ecocompute-dynamic-eval/
+- Full research paper: (arXiv link pending)
+
+---
+
+## File: `docs/source/guides/energy_efficiency.md`
+
+```markdown


It doesn't seem like any of this content was meant to be included in the actual doc files.

matthewdouglas · 2026-02-24T16:11:18Z

I think some of this information might actually fit in the transformers docs as well, e.g. this section here:
https://huggingface.co/docs/transformers/main/en/quantization#outlier-threshold
cc @SunMarc wdyt about including something in those docs?

To me, as it is right now, it seems a bit verbose though, so it would be better off as a small note with a concise explanation of the tradeoffs.

With that said, maybe it also fits in our FAQ page.

I'm also curious if you can share that PPL benchmark for other models you mentioned, or some that are even a bit larger in the 9B - 40B range for dense LLMs.

@TimDettmers may have some feedback here as well!

matthewdouglas

PR Review: #1882 — docs: add quantization and energy efficiency guide

Adds a documentation page on energy efficiency implications of quantization, based on the contributor's benchmarking from issue #1867. The data and issue discussion are genuine and the topic is worth documenting. Several blocking issues need to be resolved first.

Blocking issues (4):

1. File content is a PR draft, not documentation

As noted in the inline review comment, the committed file contains PR metadata ("# bitsandbytes Documentation PR Draft", "## PR Title", "## PR Description") rather than actual documentation content. The real content is embedded inside a fenced ```markdown block within the file. The author appears to have accidentally committed their drafting notes rather than the documentation itself. This also explains the CI documentation build failure.

2. Wrong path and missing _toctree.yml entry

The file was committed to docs/source/quantization_performance.mdx instead of docs/source/explanations/quantization_performance.mdx (as the PR description states it should be). More importantly, there is no addition to docs/source/_toctree.yml, so the page wouldn't appear in navigation regardless of path. These two issues together account for the CI failure.

3. Scope, placement, and verbosity need resolution before this can land

As noted in the PR comments, the content as written is too verbose for a standalone explanations page, and the right home for it isn't settled: it could be a concise note in the existing Transformers quantization docs (cc'd @SunMarc), a trimmed entry in the bitsandbytes FAQ, or a shorter explanations page. The author should align with maintainer preference on placement before investing in a full rewrite, since the required edits differ significantly by target format.

4. threshold=0.0 recommendation contradicts maintainer guidance, and the PPL dataset is too narrow

The guide presents threshold=0.0 as a "For Energy-Critical Deployments" recommendation. This contradicts Tim Dettmers' explicit comment in #1867 ("threshold=0.0 isn't a recommended setting for quality-sensitive workloads"). The contributor's own data shows +25.38% PPL degradation on Yi-1.5-6B, which is a severe accuracy cost for a −3.1% energy saving. Any guidance on threshold tuning should be framed as "understand the tradeoff and validate per-model," not as a prescriptive recommendation.

Additionally, as raised in the PR comments, the perplexity data is limited to a 6B model. Data from larger dense models (9B–40B range) would be needed before the guide could make reliable generalization claims about model-size crossover points and threshold behavior.

Minor issues (fix alongside the above):

The monitoring code block mixes shell commands (nvidia-smi dmon -s u) into a Python block
The "Expected improvements" percentages in the threshold=0.0 box are drawn from A800 batch experiments, not the consumer GPU single-batch data shown in the main table — the different hardware/batch context should be labeled or the numbers reconciled
The BibTeX citation block asking readers to cite the contributor's personal benchmark repo is not appropriate for official project documentation; attribution via links is sufficient
The A100/H100 section is speculative ("Further validation needed") and shouldn't be included until data exists

Suggested path forward:

Resolve the placement question (FAQ entry vs. Transformers docs note vs. short explanations page) with maintainer input
Once placement is agreed, write the actual .mdx content directly — trimmed to match the target format
Add the _toctree.yml entry if it's staying in the bitsandbytes docs
Expand PPL data to larger models before making threshold guidance claims
Reframe threshold=0.0 as a documented tradeoff rather than a deployment recommendation

Security: Clear
Downstream impact: None (docs-only)
Tests: N/A
CI: Fails (documentation build)

docs: add quantization and energy efficiency guide

bf70d99

This PR adds a comprehensive energy efficiency guide for INT8 quantization, detailing its impact on energy consumption and providing recommendations for optimization based on recent benchmarking results.

matthewdouglas added the Documentation Improvements or additions to documentation label Feb 24, 2026

matthewdouglas reviewed Feb 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

docs: add quantization and energy efficiency guide#1882

docs: add quantization and energy efficiency guide#1882
hongping-zh wants to merge 1 commit intobitsandbytes-foundation:mainfrom
hongping-zh:docs/quantization-performance-guide

hongping-zh commented Feb 24, 2026

Uh oh!

matthewdouglas Feb 24, 2026

Uh oh!

matthewdouglas commented Feb 24, 2026

Uh oh!

matthewdouglas left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Comments

Conversation

hongping-zh commented Feb 24, 2026

Summary

What this guide covers

Key data points

Methodology

File changes

Related

Notes

Uh oh!

matthewdouglas Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

matthewdouglas commented Feb 24, 2026

Uh oh!

matthewdouglas left a comment

Choose a reason for hiding this comment

PR Review: #1882 — docs: add quantization and energy efficiency guide

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants