You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,7 +27,7 @@ and [fbaldassarri](https://huggingface.co/fbaldassarri). For usage instructions,
27
27
28
28
29
29
## 🆕 What's New
30
-
[2025/10]AutoRound team proposed a fast algorithm to generate mixed bits/datatypes schemes in minutes. Please
30
+
[2025/10]We proposed a fast algorithm to generate mixed bits/datatypes schemes in minutes. Please
31
31
refer to the documentation for accuracy [results](./docs/auto_scheme_acc.md) and [this guide](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#autoscheme) for usage instructions.
32
32
33
33
[2025/09] AutoRound now includes experimental support for the mxfp4 and nvfp4 dtypes. For accuracy results, see the [documentation](./docs/mxnv_acc.md)
@@ -68,7 +68,7 @@ Support **AutoRound, AutoAWQ, AutoGPTQ, and GGUF** for maximum compatibility. De
68
68
Quantize 7B models in about 10 minutes on a single GPU. Details are shown in [quantization costs](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#quantization-costs)
Automatically configure in minutes, with about 2X-4X the model’s BF16 VRAM size as overhead.
71
+
Automatically configure in minutes, with about 2X-4X the model’s BF16 VRAM size as overhead. Accuracy [results](./docs/auto_scheme_acc.md) and [user guide](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#autoscheme).
72
72
73
73
✅ **10+ VLMs Support**
74
74
Out-of-the-box quantization for 10+ vision-language models [example models](https://huggingface.co/collections/OPEA/vlms-autoround-675bc712fdd6a55ebaf11bfa), [support matrix](https://github.com/intel/auto-round/tree/main/auto_round/mllm#support-matrix)
This setting provides the best accuracy in most scenarios but is4–5× slower than the standard AutoRound recipe. It is especially recommended for2-bit quantization andis a good choice if sufficient resources are available.
This setting offers the best speed (2-3X faster than AutoRound), but it may cause a significant accuracy drop for small models and2-bit quantization. It is recommended for4-bit settings and models larger than 3B
AutoRound(>0.8) offers auto-scheme to generate mixed bits recipe autocially, please refer to [AutoScheme](#autoscheme) section for more details.
200
+
199
201
Auto-GPTQand Auto-AWQ only support a limited set of mixed-bit configurations. If you're unsure about the details, we recommend using the AutoRound format.
200
202
201
203
vLLM and SGLang fuse MoE andQKV layers, so it's recommended not to assign different bit widths to these layers.
@@ -279,8 +281,11 @@ W2G64 Average Accuracy of 13 tasks and Time Cost Results(Testing was conducted o
279
281
280
282
AutoScheme provide automatically algorithm to provide mixed bits/data_type quantization recipes. For some accuracy result, please refer this doc [here](./auto_scheme_acc.md)
281
283
284
+
We strongly recommend set`enable_torch_compile` to True to save VRAM.
285
+
282
286
**Please note that mixed data types are supported during tuning, but cannot be exported to real models at this time..**
ar = AutoRound(model=model_name, scheme=scheme, iters=0, nsamples=1)
300
305
ar.quantize_and_save()
301
306
~~~
302
307
303
-
### Hyperparameters in AutoScheme
308
+
#### Hyperparameters in AutoScheme
304
309
`avg_bits(float)`: Target average bits for the whole model, only to be quantized layer will be counted in the average bits calculation.
305
310
306
311
`options(Union[str, list[Union[QuantizationScheme, str]])`: the options of quantization schemes to choose from. It could be a string like "W4A16", or a list of strings or QuantizationScheme objects.
307
312
308
313
`ignore_scale_zp_bits(bool)`: Whether to ignore the bits of scale and zero point in average bits calculation. Default isFalse.
309
314
310
-
`shared_layers (Optional[Iterable[Iterable[str]]])` only supported inAPI now
311
-
312
315
`device_map (Optional[str,dict,torch.device])` only supported inAPI now, as auto-scheme used more VRAM than auto-round tuning, so you could set a different device_map for it.
313
316
317
+
`shared_layers (Optional[Iterable[Iterable[str]]])` only supported inAPI now
318
+
314
319
In some serving frameworks, certain layers (e.g., QKVor MoE) are fused to accelerate inference. These fused layers may require the same data typeand bit configuration. The shared_layers option simplifies this setup by supporting both regex and full-name matching. **Note that regex matching is applied in a block-wise manner.**
315
320
316
321
@@ -329,6 +334,33 @@ ar = AutoRound(model=model_name, scheme=scheme, iters=0, nsamples=1)
329
334
model, layer_config = ar.quantize()
330
335
```
331
336
337
+
Besides, if you want to fix the scheme for some layers, you could set it via `layer_config`in AutoRound API.
0 commit comments