From a60480ad080d2687863842cc7372de588dba0bc6 Mon Sep 17 00:00:00 2001 From: Alexander Kozlov Date: Fri, 8 Mar 2024 11:19:34 +0400 Subject: [PATCH] Updated LLM guide (#23341) --- .../llm-inference-hf.rst | 34 +++++++++++-------- 1 file changed, 20 insertions(+), 14 deletions(-) diff --git a/docs/articles_en/learn-openvino/large-language-models/llm-inference-hf.rst b/docs/articles_en/learn-openvino/large-language-models/llm-inference-hf.rst index ec5357559d424e..4fd207de23abe4 100644 --- a/docs/articles_en/learn-openvino/large-language-models/llm-inference-hf.rst +++ b/docs/articles_en/learn-openvino/large-language-models/llm-inference-hf.rst @@ -81,16 +81,6 @@ In this case, you can load the converted model in OpenVINO representation direct model = OVModelForCausalLM.from_pretrained(model_id) -By default, inference will run on CPU. To select a different inference device, for example, GPU, -add ``device="GPU"`` to the ``from_pretrained()`` call. To switch to a different device after -the model has been loaded, use the ``.to()`` method. The device naming convention is the same -as in OpenVINO native API: - -.. code-block:: python - - model.to("GPU") - - Optimum-Intel API also provides out-of-the-box model optimization through weight compression using NNCF which substantially reduces the model footprint and inference latency: @@ -98,6 +88,12 @@ using NNCF which substantially reduces the model footprint and inference latency model = OVModelForCausalLM.from_pretrained(model_id, export=True, load_in_8bit=True) + # or if model was already converted + model = OVModelForCausalLM.from_pretrained(model_path, load_in_8bit=True) + + # save model after optimization + model.save_pretrained(optimized_model_path) + Weight compression is applied by default to models larger than one billion parameters and is also available for CLI interface as the ``--int8`` option. @@ -121,6 +117,15 @@ compression with ``OVWeightQuantizationConfig`` class to control weight quantiza quantization_config=OVWeightQuantizationConfig(bits=4, asym=True, ratio=0.8, dataset="ptb"), ) + # or if model was already converted + mmodel = OVModelForCausalLM.from_pretrained( + model_path, + quantization_config=OVWeightQuantizationConfig(bits=4, asym=True, ratio=0.8, dataset="ptb"), + ) + + # save model after optimization + model.save_pretrained(optimized_model_path) + The optimized model can be saved as usual with a call to ``save_pretrained()``. For more details on compression options, refer to the :doc:`weight compression guide `. @@ -168,13 +173,14 @@ an inference pipeline. This setup allows for easy text processing and model inte Converting LLMs on the fly every time to OpenVINO IR is a resource intensive task. It is a good practice to convert the model once, save it in a folder and load it for inference. -By default, inference will run on CPU. To switch to a different device, the ``device`` attribute -from the ``from_pretrained`` function can be used. The device naming convention is the -same as in OpenVINO native API: +By default, inference will run on CPU. To select a different inference device, for example, GPU, +add ``device="GPU"`` to the ``from_pretrained()`` call. To switch to a different device after +the model has been loaded, use the ``.to()`` method. The device naming convention is the same +as in OpenVINO native API: .. code-block:: python - model = OVModelForCausalLM.from_pretrained(model_id, export=True, device="GPU") + model.to("GPU") Enabling OpenVINO Runtime Optimizations ############################################################