-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update NPU GenAI guide #27788
base: releases/2024/5
Are you sure you want to change the base?
Update NPU GenAI guide #27788
Conversation
- fix optimum-cli typo - update optimum install instructions - clarify model caching - other small updates
@@ -44,7 +43,7 @@ You select one of the methods by setting the ``--group-size`` parameter to eithe | |||
.. code-block:: console | |||
:name: group-quant | |||
|
|||
optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int4 --sym --ratio 1.0 --group_size 128 TinyLlama-1.1B-Chat-v1.0 | |||
optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int4 --sym --ratio 1.0 --group-size 128 TinyLlama-1.1B-Chat-v1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TolyaTalamanov 👀👀👀
pip install nncf==2.12 onnx==1.16.1 optimum-intel==1.19.0 | ||
pip install openvino==2024.5 openvino-tokenizers==2024.5 openvino-genai==2024.5 | ||
pip install --upgrade --upgrade-strategy eager optimum[openvino] openvino-genai>=2024.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, models converted without the aforementioned components fixed to certain versions are not guaranteed to work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A downside of these pinnings is that the docs usually do not get updated. With optimum-intel 1.19.0 some models that are supported on NPU cannot be exported. Optimum Intel 1.19 does not officially support OpenVINO 2024.5, and it only supports transformers up to 4.44. We can never completely guarantee that commands don't break, but in other documentation we do not limit versions because usually newer versions have more upsides (fixes) than downsides. Optimum-intel has a bunch of dependencies and as time goes by, chances increase that with older versions one of these dependencies breaks (a dependency of a dependency does not support Python 3.12 for example), and of there can be security issues with older versions.
What about keeping the instructions more general, but also mentioning that models are verified with these specific versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's targeted to 2024/5
branch it should be openvino-genai==2024.5
, shouldn't it?
I assume we still need to specify nncf==2.12
(CC: @dmatveev )
P.S I'd prefer not changing this part at all...
1. Update NNCF: ``pip install nncf==2.13`` | ||
1. Update NNCF: ``pip install --upgrade nncf`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have exact component versions here for a reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea that users must absolutely install NNCF 2.12 for everything (as mentioned in prerequisites), but then switch to 2.13 for channel wise quantization? And then switch back to 2.12 if they want to use group quantization? If the reason for these specific versions is that that was tested at some point, then see my comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--scale_estimation
has been added in 2.13
release. So yes, the general recommendation was to use 2.12
and upgrade to 2.13
when scale_estimation
is needed.
Also wouldn't change this part...
@@ -27,7 +26,7 @@ such as Llama-2-7B, Mistral-0.2-7B, and Qwen-2-7B. | |||
Export an LLM model via Hugging Face Optimum-Intel | |||
################################################## | |||
|
|||
Since **symmetrically-quantized 4-bit (INT4) models are preffered for inference on NPU**, make | |||
Since **symmetrically-quantized 4-bit (INT4) models are supported for inference on NPU**, make | |||
sure to export the model with the proper conversion and optimization settings. | |||
|
|||
| You may export LLMs via Optimum-Intel, using one of two compression methods: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So let's please rework this text as:
**group quantization** - recommended for smaller models (<4B parameters)
**channel-wise quantization** - recommended for larger models (>4B parameters)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, updated.
Assigned this PR on @TolyaTalamanov now as he's back. |
@helena-intel could you add this, please? |
For the optimum install I chose to use upgrade even though that is not necessary in a clean env, but many people will not create the clean env and then using
--upgrade-strategy eager
prevents issues.I renamed "preferred" to "supported" for symmetric mode, because preferred may give the impression that asym is not ideal, but not a requirement, but many models don't work at all with asym mode.