Add support for LoRA adapters in vLLM inference engine #562

wizeng23 · 2024-09-27T07:58:37Z

Fixes OPE-395

Adds repr functions to Message and Conversation
Properly forward model max length to vLLM engine

linear · 2024-09-27T07:58:40Z

OPE-395 Add support for SFT/LoRA models in VLLM inference engine

Like this:

oumi/scripts/polaris/jobs/vllm_worker.sh

Line 90 in fd3ebac

# For inference on a full fine-tuned model, uncomment the following line.

But in Python: https://github.com/oumi-ai/oumi/blob/main/src/oumi/inference/vllm_inference_engine.py

nikg4 · 2024-09-27T17:06:22Z

src/oumi/core/types/turn.py

@@ -106,6 +106,11 @@ def is_text(self) -> bool:
        """Checks if the message contains text."""
        return self.type == Type.TEXT

+    def __repr__(self):
+        """Returns a string representation of the message."""
+        content = self.content if self.is_text() else "<non-text-content>"


Define a constant for "<non-text-content>" ?

Replaced with the type of the message instead.

nikg4 · 2024-09-27T17:10:25Z

src/oumi/core/types/turn.py

+    def __repr__(self):
+        """Returns a string representation of the message."""
+        content = self.content if self.is_text() else "<non-text-content>"
+        return f"{self.role.upper()}: {content}"


Let's also include type, id ? Any other small fields ?

Comment above modifies message to mention the type if it's not text. Added ID.

nikg4 · 2024-09-27T17:15:45Z

src/oumi/core/types/turn.py

+    def __repr__(self):
+        """Returns a string representation of the message."""
+        content = self.content if self.is_text() else "<non-text-content>"
+        return f"{self.role.upper()}: {content}"


why .upper() for role ?

Makes it more readable IMO.

ASSISTANT: How are you? USER: Good

vs

assistant: How are you? user: Good

nikg4 · 2024-09-27T17:17:00Z

src/oumi/core/types/turn.py

+    def __repr__(self):
+        """Returns a string representation of the message."""
+        content = self.content if self.is_text() else "<non-text-content>"
+        return f"{self.role.upper()}: {content}"


instead of building the string manually, should we use some library to do formatting for us ?

For example: you create a temp dict with fields of interest , then use json , pprint.pformat, or somesuch to convert it to string?

IMO The logic here is light enough that manually creating the string works fine.

src/oumi/core/types/turn.py

src/oumi/inference/vllm_inference_engine.py

nikg4 · 2024-09-27T17:22:28Z

src/oumi/inference/vllm_inference_engine.py

@@ -53,6 +61,8 @@ def __init__(
            quantization=quantization,
            tensor_parallel_size=tensor_parallel_size,
            enable_prefix_caching=enable_prefix_caching,
+            enable_lora=self.lora_request is not None,
+            max_model_len=model_params.model_max_length,


should we sanitize this value before passing it to vllm ?
something like
max_model_len=(model_params.model_max_length if "... is not None and ... > 0" else None)

IMO all validation should happen during config initialization, and downstream code like vLLM should be able to consume the configs as-is without additional validation. Added a validation check that model_max_length is a positive int if specified

taenin

Please add a test in tests/inference/test_vllm_inference_engine.py covering this case

wizeng23 · 2024-09-29T23:24:41Z

Done, and tested on GCP. Also added a test for turn.py.

wizeng23 added 8 commits September 18, 2024 16:30

Remove vllm from gpu target

a2d2e71

merge main

b81a70d

merge main

8a36e05

merge main

70a9509

small fix

4c02be7

Update vLLM inference engine to support LoRA

7f2b25e

merge main

2f148bb

small fixes

5d77678

wizeng23 requested review from oelachqar, taenin, jgreer013 and nikg4 September 27, 2024 07:58