Updated LLM guide (#23341)

openvinotoolkit · Mar 8, 2024 · e77238b · e77238b
1 parent 06433f8
commit e77238b
Showing 1 changed file with 20 additions and 14 deletions.
diff --git a/docs/articles_en/learn-openvino/large-language-models/llm-inference-hf.rst b/docs/articles_en/learn-openvino/large-language-models/llm-inference-hf.rst
@@ -81,23 +81,19 @@ In this case, you can load the converted model in OpenVINO representation direct
     model = OVModelForCausalLM.from_pretrained(model_id)
 
 
-By default, inference will run on CPU. To select a different inference device, for example, GPU,
-add ``device="GPU"`` to the ``from_pretrained()`` call. To switch to a different device after
-the model has been loaded, use the ``.to()`` method. The device naming convention is the same
-as in OpenVINO native API:
-
-.. code-block:: python
-
-    model.to("GPU")
-
-
 Optimum-Intel API also provides out-of-the-box model optimization through weight compression
 using NNCF which substantially reduces the model footprint and inference latency:
 
 .. code-block:: python
 
     model = OVModelForCausalLM.from_pretrained(model_id, export=True, load_in_8bit=True)
 
+    # or if model was already converted
+    model = OVModelForCausalLM.from_pretrained(model_path, load_in_8bit=True)
+
+    # save model after optimization
+    model.save_pretrained(optimized_model_path)
+
 
 Weight compression is applied by default to models larger than one billion parameters and is
 also available for CLI interface as the ``--int8`` option.
@@ -121,6 +117,15 @@ compression with ``OVWeightQuantizationConfig`` class to control weight quantiza
         quantization_config=OVWeightQuantizationConfig(bits=4, asym=True, ratio=0.8, dataset="ptb"),
     )
 
+    # or if model was already converted
+    mmodel = OVModelForCausalLM.from_pretrained(
+        model_path,
+        quantization_config=OVWeightQuantizationConfig(bits=4, asym=True, ratio=0.8, dataset="ptb"),
+    )
+
+    # save model after optimization
+    model.save_pretrained(optimized_model_path)
+
 
 The optimized model can be saved as usual with a call to ``save_pretrained()``.
 For more details on compression options, refer to the :doc:`weight compression guide <weight_compression>`.
@@ -168,13 +173,14 @@ an inference pipeline. This setup allows for easy text processing and model inte
   Converting LLMs on the fly every time to OpenVINO IR is a resource intensive task.
   It is a good practice to convert the model once, save it in a folder and load it for inference.
 
-By default, inference will run on CPU. To switch to a different device, the ``device`` attribute
-from the ``from_pretrained`` function can be used. The device naming convention is the
-same as in OpenVINO native API:
+By default, inference will run on CPU. To select a different inference device, for example, GPU,
+add ``device="GPU"`` to the ``from_pretrained()`` call. To switch to a different device after
+the model has been loaded, use the ``.to()`` method. The device naming convention is the same
+as in OpenVINO native API:
 
 .. code-block:: python
 
-  model = OVModelForCausalLM.from_pretrained(model_id, export=True, device="GPU")
+    model.to("GPU")
 
 Enabling OpenVINO Runtime Optimizations
 ############################################################