Skip to content

Commit

Permalink
Updated LLM guide (#23341)
Browse files Browse the repository at this point in the history
  • Loading branch information
AlexKoff88 authored Mar 8, 2024
1 parent 06433f8 commit e77238b
Showing 1 changed file with 20 additions and 14 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -81,23 +81,19 @@ In this case, you can load the converted model in OpenVINO representation direct
model = OVModelForCausalLM.from_pretrained(model_id)
By default, inference will run on CPU. To select a different inference device, for example, GPU,
add ``device="GPU"`` to the ``from_pretrained()`` call. To switch to a different device after
the model has been loaded, use the ``.to()`` method. The device naming convention is the same
as in OpenVINO native API:

.. code-block:: python
model.to("GPU")
Optimum-Intel API also provides out-of-the-box model optimization through weight compression
using NNCF which substantially reduces the model footprint and inference latency:

.. code-block:: python
model = OVModelForCausalLM.from_pretrained(model_id, export=True, load_in_8bit=True)
# or if model was already converted
model = OVModelForCausalLM.from_pretrained(model_path, load_in_8bit=True)
# save model after optimization
model.save_pretrained(optimized_model_path)
Weight compression is applied by default to models larger than one billion parameters and is
also available for CLI interface as the ``--int8`` option.
Expand All @@ -121,6 +117,15 @@ compression with ``OVWeightQuantizationConfig`` class to control weight quantiza
quantization_config=OVWeightQuantizationConfig(bits=4, asym=True, ratio=0.8, dataset="ptb"),
)
# or if model was already converted
mmodel = OVModelForCausalLM.from_pretrained(
model_path,
quantization_config=OVWeightQuantizationConfig(bits=4, asym=True, ratio=0.8, dataset="ptb"),
)
# save model after optimization
model.save_pretrained(optimized_model_path)
The optimized model can be saved as usual with a call to ``save_pretrained()``.
For more details on compression options, refer to the :doc:`weight compression guide <weight_compression>`.
Expand Down Expand Up @@ -168,13 +173,14 @@ an inference pipeline. This setup allows for easy text processing and model inte
Converting LLMs on the fly every time to OpenVINO IR is a resource intensive task.
It is a good practice to convert the model once, save it in a folder and load it for inference.

By default, inference will run on CPU. To switch to a different device, the ``device`` attribute
from the ``from_pretrained`` function can be used. The device naming convention is the
same as in OpenVINO native API:
By default, inference will run on CPU. To select a different inference device, for example, GPU,
add ``device="GPU"`` to the ``from_pretrained()`` call. To switch to a different device after
the model has been loaded, use the ``.to()`` method. The device naming convention is the same
as in OpenVINO native API:

.. code-block:: python
model = OVModelForCausalLM.from_pretrained(model_id, export=True, device="GPU")
model.to("GPU")
Enabling OpenVINO Runtime Optimizations
############################################################
Expand Down

0 comments on commit e77238b

Please sign in to comment.