Skip to content

Comments

docs: add quantization and energy efficiency guide#1882

Open
hongping-zh wants to merge 1 commit intobitsandbytes-foundation:mainfrom
hongping-zh:docs/quantization-performance-guide
Open

docs: add quantization and energy efficiency guide#1882
hongping-zh wants to merge 1 commit intobitsandbytes-foundation:mainfrom
hongping-zh:docs/quantization-performance-guide

Conversation

@hongping-zh
Copy link

Summary

Adds a new documentation page explaining the energy efficiency implications of different quantization configurations, based on systematic benchmarking across multiple GPU architectures.

This PR addresses the documentation request from @TimDettmers in #1867:

"A documentation PR adding guidance on when quantization may not improve energy efficiency would be welcome."

What this guide covers

  1. INT8 mixed-precision decomposition overhead: Why default LLM.int8() may increase energy consumption by 17–33% vs FP16, and why this is a justified accuracy trade-off
  2. Why threshold=0.0 is not recommended: Perplexity data showing +25% degradation vs only −3% energy savings
  3. NF4 small model overhead: Dequantization cost exceeding memory bandwidth savings for models <5B parameters
  4. Crossover point: ~5B parameters, validated across RTX 5090 and RTX 4090D
  5. Batch size impact: 84–96% energy reduction from BS=1 to BS=8–64
  6. Configuration guidelines: Recommendations organized by priority (memory/accuracy/energy) and model size

Key data points

Configuration Perplexity (Yi-1.5-6B) Δ vs FP16 Energy Δ vs FP16
FP16 (baseline) 11.16
INT8 Default (threshold=6.0) 11.20 +0.33% +32.7%
INT8 Pure (threshold=0.0) 14.00 +25.38% −3.1%

Key takeaway: The default threshold=6.0 does an excellent job preserving accuracy (+0.33% PPL). The energy overhead is the justified cost of mixed-precision decomposition.

Methodology

  • NVML power monitoring at 10 Hz, n=10 per configuration, CV < 3%
  • Hardware: RTX 5090, RTX 4090D, A800
  • Models: Yi-1.5-6B, Mistral-7B, Phi-3-mini, Qwen2.5-7B, TinyLlama-1.1B, Qwen2-1.5B, Qwen2.5-3B, Qwen2-7B
  • Perplexity: WikiText-2 test split
  • Full data: https://github.com/hongping-zh/ecocompute-ai

File changes

  • Added: docs/source/explanations/quantization_performance.mdx

Related

Notes

  • This guide validates the current default configuration rather than suggesting changes
  • The .mdx format follows existing documentation style in docs/source/explanations/
  • Happy to adjust scope, framing, or placement based on maintainer feedback

This PR adds a comprehensive energy efficiency guide for INT8 quantization, detailing its impact on energy consumption and providing recommendations for optimization based on recent benchmarking results.
@matthewdouglas matthewdouglas added the Documentation Improvements or additions to documentation label Feb 24, 2026
Comment on lines +1 to +32
# bitsandbytes Documentation PR Draft

## PR Title
Add Energy Efficiency Guide for INT8 Quantization

## PR Description

### Summary
This PR adds a comprehensive energy efficiency guide to help users understand and optimize the energy consumption of INT8 quantization.

### Motivation
Recent benchmarking on consumer GPUs (RTX 4090D, RTX 5090) revealed that **default LLM.int8() configuration can increase energy consumption by 17-33%** compared to FP16, contrary to common assumptions. This guide helps users:

1. Understand the energy implications of different INT8 configurations
2. Choose appropriate settings for their use cases
3. Avoid unintended energy waste in production deployments

### Changes
- Added `docs/source/guides/energy_efficiency.md`
- Added energy efficiency section to main documentation index
- Included benchmark results and recommendations

### References
- Benchmark repository: https://github.com/hongping-zh/ecocompute-ai
- Interactive dashboard: https://hongping-zh.github.io/ecocompute-dynamic-eval/
- Full research paper: (arXiv link pending)

---

## File: `docs/source/guides/energy_efficiency.md`

```markdown
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like any of this content was meant to be included in the actual doc files.

@matthewdouglas
Copy link
Member

I think some of this information might actually fit in the transformers docs as well, e.g. this section here:
https://huggingface.co/docs/transformers/main/en/quantization#outlier-threshold
cc @SunMarc wdyt about including something in those docs?

To me, as it is right now, it seems a bit verbose though, so it would be better off as a small note with a concise explanation of the tradeoffs.

With that said, maybe it also fits in our FAQ page.

I'm also curious if you can share that PPL benchmark for other models you mentioned, or some that are even a bit larger in the 9B - 40B range for dense LLMs.

@TimDettmers may have some feedback here as well!

Copy link
Member

@matthewdouglas matthewdouglas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: #1882 — docs: add quantization and energy efficiency guide

Adds a documentation page on energy efficiency implications of quantization, based on the contributor's benchmarking from issue #1867. The data and issue discussion are genuine and the topic is worth documenting. Several blocking issues need to be resolved first.

Blocking issues (4):

1. File content is a PR draft, not documentation

As noted in the inline review comment, the committed file contains PR metadata ("# bitsandbytes Documentation PR Draft", "## PR Title", "## PR Description") rather than actual documentation content. The real content is embedded inside a fenced ```markdown block within the file. The author appears to have accidentally committed their drafting notes rather than the documentation itself. This also explains the CI documentation build failure.

2. Wrong path and missing _toctree.yml entry

The file was committed to docs/source/quantization_performance.mdx instead of docs/source/explanations/quantization_performance.mdx (as the PR description states it should be). More importantly, there is no addition to docs/source/_toctree.yml, so the page wouldn't appear in navigation regardless of path. These two issues together account for the CI failure.

3. Scope, placement, and verbosity need resolution before this can land

As noted in the PR comments, the content as written is too verbose for a standalone explanations page, and the right home for it isn't settled: it could be a concise note in the existing Transformers quantization docs (cc'd @SunMarc), a trimmed entry in the bitsandbytes FAQ, or a shorter explanations page. The author should align with maintainer preference on placement before investing in a full rewrite, since the required edits differ significantly by target format.

4. threshold=0.0 recommendation contradicts maintainer guidance, and the PPL dataset is too narrow

The guide presents threshold=0.0 as a "For Energy-Critical Deployments" recommendation. This contradicts Tim Dettmers' explicit comment in #1867 ("threshold=0.0 isn't a recommended setting for quality-sensitive workloads"). The contributor's own data shows +25.38% PPL degradation on Yi-1.5-6B, which is a severe accuracy cost for a −3.1% energy saving. Any guidance on threshold tuning should be framed as "understand the tradeoff and validate per-model," not as a prescriptive recommendation.

Additionally, as raised in the PR comments, the perplexity data is limited to a 6B model. Data from larger dense models (9B–40B range) would be needed before the guide could make reliable generalization claims about model-size crossover points and threshold behavior.

Minor issues (fix alongside the above):

  • The monitoring code block mixes shell commands (nvidia-smi dmon -s u) into a Python block
  • The "Expected improvements" percentages in the threshold=0.0 box are drawn from A800 batch experiments, not the consumer GPU single-batch data shown in the main table — the different hardware/batch context should be labeled or the numbers reconciled
  • The BibTeX citation block asking readers to cite the contributor's personal benchmark repo is not appropriate for official project documentation; attribution via links is sufficient
  • The A100/H100 section is speculative ("Further validation needed") and shouldn't be included until data exists

Suggested path forward:

  1. Resolve the placement question (FAQ entry vs. Transformers docs note vs. short explanations page) with maintainer input
  2. Once placement is agreed, write the actual .mdx content directly — trimmed to match the target format
  3. Add the _toctree.yml entry if it's staying in the bitsandbytes docs
  4. Expand PPL data to larger models before making threshold guidance claims
  5. Reframe threshold=0.0 as a documented tradeoff rather than a deployment recommendation
  • Security: Clear
  • Downstream impact: None (docs-only)
  • Tests: N/A
  • CI: Fails (documentation build)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants