Skip to content

[Frontend] [Core] Add Tensorizer support for V1, LoRA adapter serialization and deserialization #17926

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
May 23, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
e6ddd57
feat: Add LoRA adapter support for Tensorizer
sangstar Feb 20, 2025
141ae09
fix: Add partial support for vLLM V1 with Tensorizer
sangstar May 9, 2025
8e1218c
fix: Update LoRA support from upstream changes
sangstar May 9, 2025
73f6957
chore: Rm experimental changes
sangstar May 9, 2025
6c94eed
fix: Elide type checking issue
sangstar May 10, 2025
ae31dfc
fix: Update snippet, fix tests, fix default `lora_dir`, `tensorizer_u…
sangstar May 10, 2025
35ce971
fix: Fix `TensorizerConfig` undefined issue
sangstar May 12, 2025
be68b5b
fix: Resolve imports for annotations, pre-commit
sangstar May 12, 2025
02ebfc7
fix: Don't run V1 tests for serialization
sangstar May 12, 2025
8451f2c
chore: Run pre-commit
sangstar May 12, 2025
5728d9d
fix: Enforce `tensorizer_uri` as only string-typed for linter
sangstar May 12, 2025
acd3bd2
fix: Use assertion to ensure `mypy` is satisfied with `tensorizer_uri`
sangstar May 12, 2025
125bafe
chore: Switch to `Union` annotation for Python 3.9
sangstar May 12, 2025
f0e9368
chore: Provide `reason` strings to `skipif` conditions
sangstar May 12, 2025
a46b006
chore: Temporarily set `"VLLM_USE_V1" = "0"` in Tensorizer fixture
sangstar May 13, 2025
7bd109d
feat: Implement initial changes to support V1
sangstar May 13, 2025
3195ea8
fix: Pass V1 tests, pass `TensorizerConfig` as dict to `LoRARequest`
sangstar May 14, 2025
b5e3af9
fix: Allow different `adapter_model` formats
sangstar May 14, 2025
a2f21ae
fix: Don't use percent format
sangstar May 14, 2025
a27ae64
fix: Fix `torchdynamo` issue with Tensorizer loading
sangstar May 14, 2025
6e66fd9
chore: Update `tensorize_vllm_model.py` docstring
sangstar May 14, 2025
04d04ef
fix: Use context-manager that implements `no_init_or_tensor` traceabl…
sangstar May 15, 2025
f448ffa
fix: Move test to LoRA tests in dedicated file, use smaller model for…
sangstar May 15, 2025
3d01bd5
tests: Rm redundant test
sangstar May 15, 2025
4ff308b
tests: Remove V0 constraint, clean up `tests/tensorizer_loader`
sangstar May 16, 2025
c168207
fix: Resolve next round of review comments
sangstar May 20, 2025
17bbd08
chore: Fix linter error
sangstar May 20, 2025
6e8b97b
Update vllm/lora/peft_helper.py
sangstar May 21, 2025
8b1bb8e
fix: Resolve next round of review comments
sangstar May 21, 2025
75d5681
fix: Get correct absolute path to example script
sangstar May 22, 2025
b649648
fix: Use `enforce_eager=True` for test
sangstar May 22, 2025
4e0c8c2
tests: Use tp=2 for LoRA tensorizer test
sangstar May 22, 2025
b99da09
tests: Add `@multi_gpu_test` decorator for tp=2 test
sangstar May 22, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,7 @@ steps:
- VLLM_USE_V1=0 pytest -v -s entrypoints/llm/test_guided_generate.py # it needs a clean process
- pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/correctness/
- pytest -v -s entrypoints/test_chat_utils.py
- pytest -v -s entrypoints/openai/test_tensorizer_entrypoint.py
- VLLM_USE_V1=0 pytest -v -s entrypoints/offline_mode # Needs to avoid interference with other tests

- label: Distributed Tests (4 GPUs) # 10min
Expand Down
96 changes: 82 additions & 14 deletions examples/other/tensorize_vllm_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,12 @@
import os
import uuid

from vllm import LLM
from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import EngineArgs
from vllm.model_executor.model_loader.tensorizer import (TensorizerArgs,
TensorizerConfig,
tensorize_vllm_model)
from vllm.lora.request import LoRARequest
from vllm.model_executor.model_loader.tensorizer import (
TensorizerArgs, TensorizerConfig, tensorize_lora_adapter,
tensorize_vllm_model)
from vllm.utils import FlexibleArgumentParser

# yapf conflicts with isort for this docstring
Expand All @@ -27,7 +28,7 @@
To serialize a model, install vLLM from source, then run something
like this from the root level of this repository:

python -m examples.other.tensorize_vllm_model \
python examples/other/tensorize_vllm_model.py \
--model facebook/opt-125m \
serialize \
--serialized-directory s3://my-bucket \
Expand All @@ -47,7 +48,7 @@
To deserialize a model, you can run something like this from the root
level of this repository:

python -m examples.other.tensorize_vllm_model \
python examples/other/tensorize_vllm_model.py \
--model EleutherAI/gpt-j-6B \
--dtype float16 \
deserialize \
Expand All @@ -69,7 +70,7 @@

Or for deserializing:

`python -m examples.other.tensorize_vllm_model deserialize --help`.
`python examples/other/tensorize_vllm_model.py deserialize --help`.

Once a model is serialized, tensorizer can be invoked with the `LLM` class
directly to load models:
Expand All @@ -90,11 +91,27 @@
In order to see all of the available arguments usable to configure
loading with tensorizer that are given to `TensorizerConfig`, run:

`python -m examples.other.tensorize_vllm_model deserialize --help`
`python examples/other/tensorize_vllm_model.py deserialize --help`

under the `tensorizer options` section. These can also be used for
deserialization in this example script, although `--tensorizer-uri` and
`--path-to-tensors` are functionally the same in this case.

Tensorizer can also be used to save and load LoRA adapters. A LoRA adapter
can be serialized directly with the path to the LoRA adapter on HF Hub and
a TensorizerConfig object. In this script, passing a HF id to a LoRA adapter
will serialize the LoRA adapter artifacts to `--serialized-directory`.

You can then use the LoRA adapter with `vllm serve`, for instance, by ensuring
the LoRA artifacts are in your model artifacts directory and specifying
`--enable-lora`. For instance:

```
vllm serve <model_path> \
--load-format tensorizer \
--model-loader-extra-config '{"tensorizer_uri": "<model_path>.tensors"}' \
--enable-lora
```
"""


Expand All @@ -107,6 +124,19 @@ def parse_args():
"also supported, although libsodium must be installed to "
"use it.")
parser = EngineArgs.add_cli_args(parser)

parser.add_argument(
"--lora-path",
type=str,
required=False,
help="Path to a LoRA adapter to "
"serialize along with model tensors. This can then be deserialized "
"along with the model by passing a tensorizer_config kwarg to "
"LoRARequest with type TensorizerConfig. See the docstring for this "
"for a usage example."

)

subparsers = parser.add_subparsers(dest='command')

serialize_parser = subparsers.add_parser(
Expand Down Expand Up @@ -169,11 +199,42 @@ def parse_args():


def deserialize():
llm = LLM(model=args.model,
load_format="tensorizer",
tensor_parallel_size=args.tensor_parallel_size,
model_loader_extra_config=tensorizer_config
)
if args.lora_path:
tensorizer_config.lora_dir = tensorizer_config.tensorizer_dir
llm = LLM(model=args.model,
load_format="tensorizer",
tensor_parallel_size=args.tensor_parallel_size,
model_loader_extra_config=tensorizer_config,
enable_lora=True,
)
sampling_params = SamplingParams(
temperature=0,
max_tokens=256,
stop=["[/assistant]"]
)

# Truncating this as the extra text isn't necessary
prompts = [
"[user] Write a SQL query to answer the question based on ..."
]

# Test LoRA load
print(
llm.generate(
prompts,
sampling_params,
lora_request=LoRARequest("sql-lora",
1,
args.lora_path,
tensorizer_config = tensorizer_config)
)
)
else:
llm = LLM(model=args.model,
load_format="tensorizer",
tensor_parallel_size=args.tensor_parallel_size,
model_loader_extra_config=tensorizer_config
)
return llm


Expand All @@ -197,7 +258,10 @@ def deserialize():

model_name = model_ref.split("/")[1]

keyfile = args.keyfile if args.keyfile else None
if args.command == "serialize" or args.command == "deserialize":
keyfile = args.keyfile
else:
keyfile = None

if args.model_loader_extra_config:
config = json.loads(args.model_loader_extra_config)
Expand Down Expand Up @@ -228,6 +292,10 @@ def deserialize():
encryption_keyfile=keyfile,
**credentials)

if args.lora_path:
tensorizer_config.lora_dir = tensorizer_config.tensorizer_dir
tensorize_lora_adapter(args.lora_path, tensorizer_config)

tensorize_vllm_model(engine_args, tensorizer_config)

elif args.command == "deserialize":
Expand Down
97 changes: 97 additions & 0 deletions tests/entrypoints/openai/test_tensorizer_entrypoint.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# SPDX-License-Identifier: Apache-2.0
import gc
import json
import tempfile

import openai
import pytest
import pytest_asyncio
import torch.cuda

from vllm.engine.arg_utils import EngineArgs
from vllm.model_executor.model_loader.tensorizer import (
TensorizerConfig, tensorize_lora_adapter, tensorize_vllm_model)

from ...utils import RemoteOpenAIServer

MODEL_NAME = "unsloth/llama-3.2-1b-Instruct"
LORA_PATH = "davzoku/finqa_adapter_1b"


def _cleanup():
gc.collect()
torch.cuda.empty_cache()


@pytest.fixture(autouse=True)
def cleanup():
_cleanup()


@pytest.fixture(scope='module')
def tmp_dir():
with tempfile.TemporaryDirectory() as path:
yield path


@pytest.fixture(scope='module')
def model_uri(tmp_dir):
yield f"{tmp_dir}/model.tensors"


@pytest.fixture(scope="module")
def tensorize_model_and_lora(tmp_dir, model_uri):
tensorizer_config = TensorizerConfig(tensorizer_uri=model_uri,
lora_dir=tmp_dir)
args = EngineArgs(model=MODEL_NAME, device="cuda")

tensorize_lora_adapter(LORA_PATH, tensorizer_config)
tensorize_vllm_model(args, tensorizer_config)

# Manually invoke a _cleanup() here, as the cleanup()
# fixture won't be guaranteed to be called after this
# when this fixture is used for a test
_cleanup()
yield


@pytest.fixture(scope="module")
def server(model_uri, tensorize_model_and_lora):
model_loader_extra_config = {
"tensorizer_uri": model_uri,
}

## Start OpenAI API server
args = [
"--load-format", "tensorizer", "--device", "cuda",
"--model-loader-extra-config",
json.dumps(model_loader_extra_config), "--enable-lora"
]

with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
yield remote_server


@pytest_asyncio.fixture
async def client(server):
async with server.get_async_client() as async_client:
yield async_client


@pytest.mark.asyncio
@pytest.mark.parametrize("model_name", [MODEL_NAME])
async def test_single_completion(client: openai.AsyncOpenAI, model_name: str):
_cleanup()
completion = await client.completions.create(model=model_name,
prompt="Hello, my name is",
max_tokens=5,
temperature=0.0)

assert completion.id is not None
assert completion.choices is not None and len(completion.choices) == 1
assert completion.model == MODEL_NAME
assert len(completion.choices) == 1
assert len(completion.choices[0].text) >= 5
assert completion.choices[0].finish_reason == "length"
assert completion.usage == openai.types.CompletionUsage(
completion_tokens=5, prompt_tokens=6, total_tokens=11)
Loading