-
-
Notifications
You must be signed in to change notification settings - Fork 7.8k
[Frontend] [Core] Add Tensorizer support for V1, LoRA adapter serialization and deserialization #17926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
vllm-bot
merged 33 commits into
vllm-project:main
from
coreweave:sangstar/tensorizer-lora-fix
May 23, 2025
Merged
[Frontend] [Core] Add Tensorizer support for V1, LoRA adapter serialization and deserialization #17926
Changes from all commits
Commits
Show all changes
33 commits
Select commit
Hold shift + click to select a range
e6ddd57
feat: Add LoRA adapter support for Tensorizer
sangstar 141ae09
fix: Add partial support for vLLM V1 with Tensorizer
sangstar 8e1218c
fix: Update LoRA support from upstream changes
sangstar 73f6957
chore: Rm experimental changes
sangstar 6c94eed
fix: Elide type checking issue
sangstar ae31dfc
fix: Update snippet, fix tests, fix default `lora_dir`, `tensorizer_u…
sangstar 35ce971
fix: Fix `TensorizerConfig` undefined issue
sangstar be68b5b
fix: Resolve imports for annotations, pre-commit
sangstar 02ebfc7
fix: Don't run V1 tests for serialization
sangstar 8451f2c
chore: Run pre-commit
sangstar 5728d9d
fix: Enforce `tensorizer_uri` as only string-typed for linter
sangstar acd3bd2
fix: Use assertion to ensure `mypy` is satisfied with `tensorizer_uri`
sangstar 125bafe
chore: Switch to `Union` annotation for Python 3.9
sangstar f0e9368
chore: Provide `reason` strings to `skipif` conditions
sangstar a46b006
chore: Temporarily set `"VLLM_USE_V1" = "0"` in Tensorizer fixture
sangstar 7bd109d
feat: Implement initial changes to support V1
sangstar 3195ea8
fix: Pass V1 tests, pass `TensorizerConfig` as dict to `LoRARequest`
sangstar b5e3af9
fix: Allow different `adapter_model` formats
sangstar a2f21ae
fix: Don't use percent format
sangstar a27ae64
fix: Fix `torchdynamo` issue with Tensorizer loading
sangstar 6e66fd9
chore: Update `tensorize_vllm_model.py` docstring
sangstar 04d04ef
fix: Use context-manager that implements `no_init_or_tensor` traceabl…
sangstar f448ffa
fix: Move test to LoRA tests in dedicated file, use smaller model for…
sangstar 3d01bd5
tests: Rm redundant test
sangstar 4ff308b
tests: Remove V0 constraint, clean up `tests/tensorizer_loader`
sangstar c168207
fix: Resolve next round of review comments
sangstar 17bbd08
chore: Fix linter error
sangstar 6e8b97b
Update vllm/lora/peft_helper.py
sangstar 8b1bb8e
fix: Resolve next round of review comments
sangstar 75d5681
fix: Get correct absolute path to example script
sangstar b649648
fix: Use `enforce_eager=True` for test
sangstar 4e0c8c2
tests: Use tp=2 for LoRA tensorizer test
sangstar b99da09
tests: Add `@multi_gpu_test` decorator for tp=2 test
sangstar File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
# SPDX-License-Identifier: Apache-2.0 | ||
import gc | ||
import json | ||
import tempfile | ||
|
||
import openai | ||
import pytest | ||
import pytest_asyncio | ||
import torch.cuda | ||
|
||
from vllm.engine.arg_utils import EngineArgs | ||
from vllm.model_executor.model_loader.tensorizer import ( | ||
TensorizerConfig, tensorize_lora_adapter, tensorize_vllm_model) | ||
|
||
from ...utils import RemoteOpenAIServer | ||
|
||
MODEL_NAME = "unsloth/llama-3.2-1b-Instruct" | ||
LORA_PATH = "davzoku/finqa_adapter_1b" | ||
|
||
|
||
def _cleanup(): | ||
gc.collect() | ||
torch.cuda.empty_cache() | ||
|
||
|
||
@pytest.fixture(autouse=True) | ||
def cleanup(): | ||
_cleanup() | ||
|
||
|
||
@pytest.fixture(scope='module') | ||
def tmp_dir(): | ||
with tempfile.TemporaryDirectory() as path: | ||
yield path | ||
|
||
|
||
@pytest.fixture(scope='module') | ||
def model_uri(tmp_dir): | ||
yield f"{tmp_dir}/model.tensors" | ||
|
||
|
||
@pytest.fixture(scope="module") | ||
def tensorize_model_and_lora(tmp_dir, model_uri): | ||
tensorizer_config = TensorizerConfig(tensorizer_uri=model_uri, | ||
lora_dir=tmp_dir) | ||
args = EngineArgs(model=MODEL_NAME, device="cuda") | ||
|
||
tensorize_lora_adapter(LORA_PATH, tensorizer_config) | ||
tensorize_vllm_model(args, tensorizer_config) | ||
|
||
# Manually invoke a _cleanup() here, as the cleanup() | ||
# fixture won't be guaranteed to be called after this | ||
# when this fixture is used for a test | ||
_cleanup() | ||
yield | ||
|
||
|
||
@pytest.fixture(scope="module") | ||
def server(model_uri, tensorize_model_and_lora): | ||
model_loader_extra_config = { | ||
"tensorizer_uri": model_uri, | ||
} | ||
|
||
## Start OpenAI API server | ||
args = [ | ||
"--load-format", "tensorizer", "--device", "cuda", | ||
"--model-loader-extra-config", | ||
json.dumps(model_loader_extra_config), "--enable-lora" | ||
] | ||
|
||
with RemoteOpenAIServer(MODEL_NAME, args) as remote_server: | ||
yield remote_server | ||
|
||
|
||
@pytest_asyncio.fixture | ||
async def client(server): | ||
async with server.get_async_client() as async_client: | ||
yield async_client | ||
|
||
|
||
@pytest.mark.asyncio | ||
@pytest.mark.parametrize("model_name", [MODEL_NAME]) | ||
async def test_single_completion(client: openai.AsyncOpenAI, model_name: str): | ||
_cleanup() | ||
completion = await client.completions.create(model=model_name, | ||
prompt="Hello, my name is", | ||
max_tokens=5, | ||
temperature=0.0) | ||
|
||
assert completion.id is not None | ||
assert completion.choices is not None and len(completion.choices) == 1 | ||
assert completion.model == MODEL_NAME | ||
assert len(completion.choices) == 1 | ||
assert len(completion.choices[0].text) >= 5 | ||
assert completion.choices[0].finish_reason == "length" | ||
assert completion.usage == openai.types.CompletionUsage( | ||
completion_tokens=5, prompt_tokens=6, total_tokens=11) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.