diff --git a/docs/source/index.rst b/docs/source/index.rst index 3e2331907f0f2..9b53a643b8d46 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -82,6 +82,7 @@ Documentation models/supported_models models/adding_model models/engine_args + models/lora .. toctree:: :maxdepth: 1 diff --git a/docs/source/models/lora.rst b/docs/source/models/lora.rst new file mode 100644 index 0000000000000..b773edfc6ff2b --- /dev/null +++ b/docs/source/models/lora.rst @@ -0,0 +1,52 @@ +.. _lora: + +Using LoRA adapters +=================== + +This document shows you how to use `LoRA adapters `_ with vLLM on top of a base model. +Adapters can be efficiently served on a per request basis with minimal overhead. First we download the adapter(s) and save +them locally with + +.. code-block:: python + + from huggingface_hub import snapshot_download + + sql_lora_path = snapshot_download(repo_id="yard1/llama-2-7b-sql-lora-test") + + +Then we instantiate the base model and pass in the ``enable_lora=True`` flag: + +.. code-block:: python + + from vllm import LLM, SamplingParams + from vllm.lora.request import LoRARequest + + llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_lora=True) + + +We can now submit the prompts and call ``llm.generate`` with the ``lora_request`` parameter. The first parameter +of ``LoRARequest`` is a human identifiable name, the second parameter is a globally unique ID for the adapter and +the third parameter is the path to the LoRA adapter. + +.. code-block:: python + + sampling_params = SamplingParams( + temperature=0, + max_tokens=256, + stop=["[/assistant]"] + ) + + prompts = [ + "[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]", + "[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_11 (nationality VARCHAR, elector VARCHAR)\n\n question: When Anchero Pantaleone was the elector what is under nationality? [/user] [assistant]", + ] + + outputs = llm.generate( + prompts, + sampling_params, + lora_request=LoRARequest("sql_adapter", 1, sql_lora_path) + ) + + +Check out `examples/multilora_inference.py `_ +for an example of how to use LoRA adapters with the async engine and how to use more advanced configuration options. \ No newline at end of file