docs: add quantization and energy efficiency guide#1882
docs: add quantization and energy efficiency guide#1882hongping-zh wants to merge 1 commit intobitsandbytes-foundation:mainfrom
Conversation
This PR adds a comprehensive energy efficiency guide for INT8 quantization, detailing its impact on energy consumption and providing recommendations for optimization based on recent benchmarking results.
| # bitsandbytes Documentation PR Draft | ||
|
|
||
| ## PR Title | ||
| Add Energy Efficiency Guide for INT8 Quantization | ||
|
|
||
| ## PR Description | ||
|
|
||
| ### Summary | ||
| This PR adds a comprehensive energy efficiency guide to help users understand and optimize the energy consumption of INT8 quantization. | ||
|
|
||
| ### Motivation | ||
| Recent benchmarking on consumer GPUs (RTX 4090D, RTX 5090) revealed that **default LLM.int8() configuration can increase energy consumption by 17-33%** compared to FP16, contrary to common assumptions. This guide helps users: | ||
|
|
||
| 1. Understand the energy implications of different INT8 configurations | ||
| 2. Choose appropriate settings for their use cases | ||
| 3. Avoid unintended energy waste in production deployments | ||
|
|
||
| ### Changes | ||
| - Added `docs/source/guides/energy_efficiency.md` | ||
| - Added energy efficiency section to main documentation index | ||
| - Included benchmark results and recommendations | ||
|
|
||
| ### References | ||
| - Benchmark repository: https://github.com/hongping-zh/ecocompute-ai | ||
| - Interactive dashboard: https://hongping-zh.github.io/ecocompute-dynamic-eval/ | ||
| - Full research paper: (arXiv link pending) | ||
|
|
||
| --- | ||
|
|
||
| ## File: `docs/source/guides/energy_efficiency.md` | ||
|
|
||
| ```markdown |
There was a problem hiding this comment.
It doesn't seem like any of this content was meant to be included in the actual doc files.
|
I think some of this information might actually fit in the transformers docs as well, e.g. this section here: To me, as it is right now, it seems a bit verbose though, so it would be better off as a small note with a concise explanation of the tradeoffs. With that said, maybe it also fits in our FAQ page. I'm also curious if you can share that PPL benchmark for other models you mentioned, or some that are even a bit larger in the 9B - 40B range for dense LLMs. @TimDettmers may have some feedback here as well! |
matthewdouglas
left a comment
There was a problem hiding this comment.
PR Review: #1882 — docs: add quantization and energy efficiency guide
Adds a documentation page on energy efficiency implications of quantization, based on the contributor's benchmarking from issue #1867. The data and issue discussion are genuine and the topic is worth documenting. Several blocking issues need to be resolved first.
Blocking issues (4):
1. File content is a PR draft, not documentation
As noted in the inline review comment, the committed file contains PR metadata ("# bitsandbytes Documentation PR Draft", "## PR Title", "## PR Description") rather than actual documentation content. The real content is embedded inside a fenced ```markdown block within the file. The author appears to have accidentally committed their drafting notes rather than the documentation itself. This also explains the CI documentation build failure.
2. Wrong path and missing _toctree.yml entry
The file was committed to docs/source/quantization_performance.mdx instead of docs/source/explanations/quantization_performance.mdx (as the PR description states it should be). More importantly, there is no addition to docs/source/_toctree.yml, so the page wouldn't appear in navigation regardless of path. These two issues together account for the CI failure.
3. Scope, placement, and verbosity need resolution before this can land
As noted in the PR comments, the content as written is too verbose for a standalone explanations page, and the right home for it isn't settled: it could be a concise note in the existing Transformers quantization docs (cc'd @SunMarc), a trimmed entry in the bitsandbytes FAQ, or a shorter explanations page. The author should align with maintainer preference on placement before investing in a full rewrite, since the required edits differ significantly by target format.
4. threshold=0.0 recommendation contradicts maintainer guidance, and the PPL dataset is too narrow
The guide presents threshold=0.0 as a "For Energy-Critical Deployments" recommendation. This contradicts Tim Dettmers' explicit comment in #1867 ("threshold=0.0 isn't a recommended setting for quality-sensitive workloads"). The contributor's own data shows +25.38% PPL degradation on Yi-1.5-6B, which is a severe accuracy cost for a −3.1% energy saving. Any guidance on threshold tuning should be framed as "understand the tradeoff and validate per-model," not as a prescriptive recommendation.
Additionally, as raised in the PR comments, the perplexity data is limited to a 6B model. Data from larger dense models (9B–40B range) would be needed before the guide could make reliable generalization claims about model-size crossover points and threshold behavior.
Minor issues (fix alongside the above):
- The monitoring code block mixes shell commands (
nvidia-smi dmon -s u) into a Python block - The "Expected improvements" percentages in the
threshold=0.0box are drawn from A800 batch experiments, not the consumer GPU single-batch data shown in the main table — the different hardware/batch context should be labeled or the numbers reconciled - The BibTeX citation block asking readers to cite the contributor's personal benchmark repo is not appropriate for official project documentation; attribution via links is sufficient
- The A100/H100 section is speculative ("Further validation needed") and shouldn't be included until data exists
Suggested path forward:
- Resolve the placement question (FAQ entry vs. Transformers docs note vs. short explanations page) with maintainer input
- Once placement is agreed, write the actual
.mdxcontent directly — trimmed to match the target format - Add the
_toctree.ymlentry if it's staying in the bitsandbytes docs - Expand PPL data to larger models before making threshold guidance claims
- Reframe
threshold=0.0as a documented tradeoff rather than a deployment recommendation
- Security: Clear
- Downstream impact: None (docs-only)
- Tests: N/A
- CI: Fails (documentation build)
Summary
Adds a new documentation page explaining the energy efficiency implications of different quantization configurations, based on systematic benchmarking across multiple GPU architectures.
This PR addresses the documentation request from @TimDettmers in #1867:
What this guide covers
LLM.int8()may increase energy consumption by 17–33% vs FP16, and why this is a justified accuracy trade-offthreshold=0.0is not recommended: Perplexity data showing +25% degradation vs only −3% energy savingsKey data points
Key takeaway: The default threshold=6.0 does an excellent job preserving accuracy (+0.33% PPL). The energy overhead is the justified cost of mixed-precision decomposition.
Methodology
File changes
Related
Notes
.mdxformat follows existing documentation style indocs/source/explanations/