Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Frontend] [Core] feat: Add model loading using tensorizer #3476

Merged
merged 102 commits into from
Apr 14, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
102 commits
Select commit Hold shift + click to select a range
dfe2f2f
feat: Support loading model tensors using `tensorizer`
sangstar Feb 1, 2024
097f297
fix: Remove unnecessary files
sangstar Feb 2, 2024
24e8657
fix(vllm-tensorizer): Allow providing S3 credentials
sangstar Feb 6, 2024
6192ff3
fix: Fix passing S3 auth vars through stream
sangstar Feb 7, 2024
fbc847b
fix: Disallowing `plaid_mode = False` and updating `tensorizer` version
sangstar Feb 13, 2024
f4d57d8
refactor: Retire use of `download_dir` as `TensorizerArgs` param
sangstar Feb 13, 2024
cf42149
fix: Remove `store_true` action for `--tensorizer-uri`
sangstar Feb 13, 2024
c1839f4
refactor: No 2x copying for `tensorizer` (WIP)
sangstar Feb 28, 2024
b28b26e
chore: Omit commandeering weight loaders for merging layers (WIP)
sangstar Feb 29, 2024
fad72a4
feat: Re-add deserializing vLLM models
sangstar Mar 1, 2024
8d421b4
chore: Harmonize CPU and GPU deserializing
sangstar Mar 1, 2024
8225c32
perf: Add `force_http=True` for faster loading speeds
sangstar Mar 5, 2024
f7c9cc7
chore: Reformat code with `format.sh`, cleanup debugging code
sangstar Mar 8, 2024
44b05ba
chore: Fix formatting, some misc. changes
sangstar Mar 11, 2024
17977b0
fix: Correct logging for loading tensorizer with cpu
sangstar Mar 11, 2024
68f2a51
chore: Implement changes from feedback
sangstar Mar 11, 2024
0c72c2c
fix: Correctly instantiate vLLM-formatted models
sangstar Mar 12, 2024
af10594
chore: Reformat and delete deprecated comment from `.ipynb`
sangstar Mar 12, 2024
550983a
perf: Allow passing of deserializer args from `TensorizerArgs`
sangstar Mar 12, 2024
6273266
style: Reformat with new formatting changes
sangstar Mar 12, 2024
f6a695b
Run yapf and ruff
sangstar Mar 12, 2024
f30f4e0
fix: Fix incorrect `TensorizerArgs` import in `config.py`
sangstar Mar 12, 2024
c539880
perf: Multiple misc. improvements from code review
sangstar Mar 14, 2024
1632381
pref: More misc. fixes to complete initial code review
sangstar Mar 15, 2024
4085cb5
fix: Remove `print(tensorizer_args)`
sangstar Mar 15, 2024
81a752a
Run yapf and ruff
sangstar Mar 15, 2024
aa8d8b4
fix: Add specific category for warnings with `PerformanceWarning`
sangstar Mar 15, 2024
d8e71df
chore: Multiple fixes from final code review
sangstar Mar 18, 2024
5132dd7
fix: Add `s3_endpoint` as attr for `TensorizerArgs`
sangstar Mar 18, 2024
7dd43f5
chore: Remove `filter_func` from CLI args, some doc fixes
sangstar Mar 18, 2024
ad68ff5
chore: Allow env var or CLI arg specification for S3 credentials
sangstar Mar 18, 2024
f965730
fix: Disallow using `force_http`
sangstar Mar 18, 2024
2605a33
chore: Remove unnecessary print statement in example script
sangstar Mar 18, 2024
71c2cb0
Run yapf and ruff
sangstar Mar 18, 2024
35b29e8
Run yapf and ruff
sangstar Mar 18, 2024
117feec
Run yapf and ruff
sangstar Mar 18, 2024
6192e9d
docs: Update `tensorizer` as a `--load-format` in `engine_args.rst`
sangstar Mar 18, 2024
407b32e
fix: Restore `tensorizer_args` as instance attr to `EngineArgs`
sangstar Mar 18, 2024
88e209d
Run yapf and ruff
sangstar Mar 18, 2024
6e23dcd
chore: Move testing out of own test folder
sangstar Mar 19, 2024
05c0bbe
fix: Add `tensorizer >= 2.8.1` to `requirements-rocm.txt` for CI
sangstar Mar 20, 2024
af11a53
fix: Add version of `tensorizer` that will pass testing suite
sangstar Mar 21, 2024
8ece4f8
chore: Add notice that `requirements-dev` dep can be removed `>2.8.1`
sangstar Mar 21, 2024
d4a46a5
fix: Resolve double `HfFileSystem` import
sangstar Mar 25, 2024
12b1f12
style: Run `isort`
sangstar Mar 25, 2024
445ab28
Run yapf and ruff
sangstar Apr 1, 2024
6c286ed
fix: Add `tensorizer` to mock imports
sangstar Apr 2, 2024
37348f9
perf: Add newest `tensorizer` version that will not init CUDA
sangstar Apr 3, 2024
82da7a5
fix: Adjust `tensorizer` version for `requirements-dev.txt`
sangstar Apr 3, 2024
310dd68
chore: Rebase and fix carrying over changes to `arg_utils` typing
sangstar Apr 3, 2024
cf56513
fix: Add `tensorizer` to `requirements-cpu.txt`
sangstar Apr 3, 2024
9c8db87
perf: Add concurrent reading to `TensorDeserializer`
sangstar Apr 3, 2024
8ca0cb1
docs: Add `num_readers` docstring
sangstar Apr 3, 2024
21bca06
chore: Replace `PerformanceWarning` after rebase
sangstar Apr 4, 2024
0c82446
Run yapf and ruff
sangstar Apr 4, 2024
06cd26d
fix: Fix model output on deserialization and add e2e output test
sangstar Apr 10, 2024
f19ee64
fix: Properly ensure test outputs are deterministic, add HF model test
sangstar Apr 10, 2024
f1f2e16
fix: Make vLLM tensorizing specification less hacky
sangstar Apr 10, 2024
71a9f79
docs: Add tensorizer link in `engine_args.rst`, docstring to example
sangstar Apr 10, 2024
9e5456a
chore: Resolve comments
sangstar Apr 10, 2024
74a8642
fix: Affirm mandatory `vllm_tensorized` argument change
sangstar Apr 10, 2024
f82b25a
perf: Allow preliminary support deserializing with LoRA adapters
sangstar Apr 10, 2024
3ec85e0
fix: Fix requirements.txt passing import tensorizer only if installed
sangstar Apr 10, 2024
dfb7a11
fix: Properly ensure import fail if tensorizer not used nor installed
sangstar Apr 10, 2024
81196ed
perf: Move test location and add testing for LoRA
sangstar Apr 10, 2024
5ecf4ee
perf: Add some testing changes, introduce `TensorizerConfig`
sangstar Apr 11, 2024
b267cbd
chore: Add `__init__.py` for `tests/tensorizer`
sangstar Apr 11, 2024
6bff0c7
tests: Fix `test_tensorizer.py` to account for new changes
sangstar Apr 11, 2024
e0b7184
tests: Remove `test_tensorizer_api_server.py`
sangstar Apr 11, 2024
3ec105d
Run yapf and ruff; fix tests
sangstar Apr 11, 2024
9d568fc
fix: Revert change to `examples/multilora_inference.py`
sangstar Apr 11, 2024
55d2a41
Merge remote-tracking branch 'upstream/main' into sangstar/integrate-…
sangstar Apr 11, 2024
65bc7bb
Merge remote-tracking branch 'upstream/main' into sangstar/integrate-…
sangstar Apr 11, 2024
b1b5653
perf: Update code to reflect change in #3977
sangstar Apr 11, 2024
5f27722
chore: Remove accidental syntax error
sangstar Apr 11, 2024
8240af9
docs: Elaborate on S3 credentialing
sangstar Apr 11, 2024
7f5eada
fix: Properly passing `tensorizer_config` to hf weight loader
sangstar Apr 12, 2024
e0d9cc7
fix: Fix, test tensorizer uri passing without tensorizer load format
sangstar Apr 12, 2024
de54538
docs: Note example script in docs for more information
sangstar Apr 12, 2024
1feab4e
chore: Run yapf and ruff, as well as doc edits
sangstar Apr 12, 2024
a9b0241
fix: Fix `initialize_model_parallel` import
sangstar Apr 12, 2024
a297a62
tests: Add test for `examples/tensorize_vllm_model.py`
sangstar Apr 12, 2024
2d07568
tests: Fix lora test
sangstar Apr 12, 2024
1bddfe6
Run yapf and ruff
sangstar Apr 12, 2024
852f0ad
fix: Move `tensorize_loader` imports to pass CPU test
sangstar Apr 12, 2024
aef7442
refactor: Pass `TensorizerArgs` direct to `EngineArgs.add_cli_args`
sangstar Apr 12, 2024
ff0a528
tests: Add api_server test using tensorizer
sangstar Apr 12, 2024
4551b84
fix: Add `tensorizer_config` to `RayGPUExecutor`
sangstar Apr 12, 2024
d51b0bc
tests: Formatting and add test to ensure `tensorizer` load format
sangstar Apr 12, 2024
64178e4
style: Run yapf on `examples/tensorize_vllm_model.py`
sangstar Apr 12, 2024
2f4dcb3
style: Run isort on `examples/tensorize_vllm_model.py`
sangstar Apr 12, 2024
3df1945
style: Fix yapf and isort conflict
sangstar Apr 12, 2024
eb925f0
fix: Remove `tensorizer_args` from `ModelConfig`
sangstar Apr 12, 2024
ba6927d
fix: Add error for device scattering and initial handling for quant
sangstar Apr 13, 2024
bd461cc
perf: Multiple changes in response to comments
sangstar Apr 13, 2024
ca2a3fb
perf: Final changes to resolve comments
sangstar Apr 13, 2024
428f53d
fix: Skip tests if cURL not installed, add example script for testing
sangstar Apr 13, 2024
88f1a67
Run yapf and ruff
sangstar Apr 13, 2024
d2491ac
tests: Install cURL for tensorizer tests for testing suite
sangstar Apr 13, 2024
d77215f
tests: Install libsodium23 for CI tensorizer tests
sangstar Apr 13, 2024
9de338c
fix: Fix testing import path
sangstar Apr 13, 2024
95251d7
Run yapf and ruff
sangstar Apr 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
feat: Re-add deserializing vLLM models
Integrated changes from
ssteel/tensorizer-support branch that allowed
for deserializing vLLM models.
  • Loading branch information
sangstar committed Apr 4, 2024
commit fad72a4c10905456f1939142bb05cd4943cdd801
98 changes: 1 addition & 97 deletions vllm/engine/arg_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,103 +14,7 @@
VisionLanguageConfig)
from vllm.utils import str_to_int_tuple


@dataclass
class TensorizerArgs:
download_dir: Union[io.BufferedIOBase, io.RawIOBase, typing.BinaryIO, str,
bytes, os.PathLike, int]
device: Optional[Union[torch.device, str]] = None
dtype: Optional[torch.dtype] = None
## Commenting out serializer_encryption until I work out how I want to implement it
# serializer_encryption: Optional[bool] = False
lazy_load: bool = False
plaid_mode_buffers: Optional[int] = None
verify_hash: bool = False
filter_func: Optional[Callable[[str], Union[bool, Any]]] = None
deserializer_encryption_key: Optional[str] = None

def __post_init__(self):
self.file_obj = self.tensorizer_uri
self.s3_access_key_id = os.environ.get("S3_ACCESS_KEY_ID") or None
self.s3_secret_access_key = os.environ.get("S3_SECRET_ACCESS_KEY") or None
self.s3_endpoint = os.environ.get("S3_ENDPOINT_URL") or None

self.credentials = {
"s3_access_key_id": self.s3_access_key_id,
"s3_secret_access_key": self.s3_secret_access_key,
"s3_endpoint": self.s3_endpoint,
}
self.serializer_params = {
# Placeholder for now
}


# Omitting self.dtype and self.device as this behaves weirdly
self.deserializer_params = {
"filter_func": self.filter_func,
"lazy_load": self.lazy_load,
"plaid_mode": True,
"plaid_mode_buffers": self.plaid_mode_buffers,
"verify_hash": self.verify_hash,
"encryption": self.deserializer_encryption_key,
# "dtype":self.dtype,
# "device":self.device,
}

@staticmethod
def add_cli_args(
parser: argparse.ArgumentParser) -> argparse.ArgumentParser:
"""Tensorizer CLI arguments"""
# TODO: Add support for encryption -- CLI args can be base64 encoded
# key/password for --serializer-encryption. Need to revist
parser.add_argument(
"--serializer-encryption",
action='store_true',
help="An `EncryptionParams` object holding a password or key"
"to use for encryption. If None, no encryption will be used.")
parser.add_argument(
"--lazy-load",
action='store_true',
help="If True, tensors will be loaded and cached when keys are"
"accessed. If False, all tensors will be loaded into memory up"
"front.",
)
parser.add_argument(
"--tensorizer-uri",
help="Path to serialized model tensors. Can be a local file path"
"or a S3 URI.",
)
parser.add_argument(
"--plaid-mode-buffers",
default=None,
help="The number of buffers to use in plaid mode."
"This is only used if ``plaid_mode=True``. These buffers"
"are used to pipeline the loading and processing of tensors.")
parser.add_argument(
"--verify-hash",
action='store_true',
help="If True, the hashes of each tensor will be verified"
"against the hashes stored in the metadata. A `HashMismatchError`"
"will be raised if any of the hashes do not match.")
parser.add_argument(
"--deserializer-encryption-key",
default=None,
help="A `DecryptionParams` object holding a password or key"
"to use for decryption. ``None`` (the default) means no decryption."
)
return parser

@classmethod
def from_cli_args(cls, args: argparse.Namespace) -> 'TensorizerArgs':
# Get the list of attributes of this dataclass.
attrs = [attr.name for attr in dataclasses.fields(cls)]
# Set the attributes from the parsed arguments.
tensorizer_args = cls(**{
attr: getattr(args, attr)
for attr in attrs if hasattr(args, attr)
})
return tensorizer_args

from vllm.model_executor.tensorizer_loader import TensorizerArgs

@dataclass
class EngineArgs:
Expand Down
11 changes: 6 additions & 5 deletions vllm/model_executor/model_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,11 +78,12 @@ def get_model(model_config: ModelConfig, device_config: DeviceConfig,
# Create a model instance.
# The weights will be initialized as empty tensors.
with torch.device(device_config.device):
if hasattr(model_class, "supported_lora_modules"):
from vllm.model_executor.tensorizer_loader import zero_length_init
with zero_length_init():
model = model_class(model_config.hf_config, linear_method,
lora_config)
if model_config.load_format == "tensorizer" and _is_vllm_model(model_config):
model = load_with_tensorizer(model_class, model_config)
return model.eval()
elif hasattr(model_class, "supported_lora_modules"):
model = model_class(model_config.hf_config, linear_method,
lora_config)
elif lora_config:
raise ValueError(
f"Model {model_class.__name__} does not support LoRA, "
Expand Down
212 changes: 204 additions & 8 deletions vllm/model_executor/tensorizer_loader.py
Original file line number Diff line number Diff line change
@@ -1,35 +1,228 @@
import contextlib
import contextvars
import dataclasses
import functools
import threading
import time
import typing
from types import MethodType
from typing import Optional
from typing import Type, Union, Any, Callable
import io
import os
import argparse


import torch
from dataclasses import dataclass
from tensorizer import TensorDeserializer, stream_io
from tensorizer.utils import convert_bytes, get_mem_usage, no_init_or_tensor
from torch import nn

from vllm.model_executor.layers.activation import ScaledActivation
from vllm.model_executor.layers.linear import ColumnParallelLinear, MergedColumnParallelLinear, RowParallelLinear, \
QKVParallelLinear
from vllm.model_executor.layers.vocab_parallel_embedding import VocabParallelEmbedding
from vllm.model_executor.models.mixtral import MixtralMoE
from vllm.config import ModelConfig
from vllm.logger import init_logger
from vllm.model_executor.layers.linear import MergedColumnParallelLinear, QKVParallelLinear

logger = init_logger(__name__)

def load_with_tensorizer(model_cls: Type[nn.Module], model_config: ModelConfig) -> nn.Module:
tensorizer = TensorizerAgent(model_cls, model_config)
return tensorizer.deserialize()

def _is_vllm_model(model_config: ModelConfig) -> bool:
sangstar marked this conversation as resolved.
Show resolved Hide resolved
return "vllm" in model_config.tensorizer_args.tensorizer_uri

def _make_model_contiguous(model: nn.Module):
# Ensure tensors are saved in memory contiguously
for param in model.parameters():
param.data = param.data.contiguous()


@dataclass
class TensorizerArgs:
tensorizer_uri: Union[
io.BufferedIOBase,
io.RawIOBase,
typing.BinaryIO,
str,
bytes,
os.PathLike,
int,
]
device: Optional[Union[torch.device, str]] = None
dtype: Optional[torch.dtype] = None
## Commenting out serializer_encryption until I work out how I want to implement it
# serializer_encryption: Optional[bool] = False
lazy_load: bool = False
plaid_mode_buffers: Optional[int] = None
verify_hash: bool = False
filter_func: Optional[Callable[[str], Union[bool, Any]]] = None
deserializer_encryption_key: Optional[str] = None

def __post_init__(self):
self.file_obj = self.tensorizer_uri
self.s3_access_key_id = os.environ.get("S3_ACCESS_KEY_ID") or None
self.s3_secret_access_key = os.environ.get("S3_SECRET_ACCESS_KEY") or None
self.s3_endpoint = os.environ.get("S3_ENDPOINT_URL") or None

self.credentials = {
"s3_access_key_id": self.s3_access_key_id,
"s3_secret_access_key": self.s3_secret_access_key,
"s3_endpoint": self.s3_endpoint,
}
self.serializer_params = {
# Placeholder for now
}


# Omitting self.dtype and self.device as this behaves weirdly
self.deserializer_params = {
"filter_func": self.filter_func,
"lazy_load": self.lazy_load,
"plaid_mode": True if not self.device == "cpu" else False,
"plaid_mode_buffers": self.plaid_mode_buffers,
"verify_hash": self.verify_hash,
"encryption": self.deserializer_encryption_key,
# "dtype":self.dtype,
# "device":self.device,
}

@staticmethod
def add_cli_args(parser: argparse.ArgumentParser) -> argparse.ArgumentParser:
"""Tensorizer CLI arguments"""
# TODO: Add support for encryption -- CLI args can be base64 encoded
# key/password for --serializer-encryption. Need to revist
parser.add_argument(
"--serializer-encryption",
action="store_true",
help="An `EncryptionParams` object holding a password or key"
"to use for encryption. If None, no encryption will be used.",
)
parser.add_argument(
"--lazy-load",
action="store_true",
help="If True, tensors will be loaded and cached when keys are"
"accessed. If False, all tensors will be loaded into memory up"
"front.",
)
parser.add_argument(
"--tensorizer-uri",
help="Path to serialized model tensors. Can be a local file path"
"or a S3 URI.",
)
parser.add_argument(
"--plaid-mode-buffers",
default=None,
help="The number of buffers to use in plaid mode."
"This is only used if ``plaid_mode=True``. These buffers"
"are used to pipeline the loading and processing of tensors.",
)
parser.add_argument(
"--verify-hash",
action="store_true",
help="If True, the hashes of each tensor will be verified"
"against the hashes stored in the metadata. A `HashMismatchError`"
"will be raised if any of the hashes do not match.",
)
parser.add_argument(
"--deserializer-encryption-key",
default=None,
help="A `DecryptionParams` object holding a password or key"
"to use for decryption. ``None`` (the default) means no decryption.",
)
return parser

@classmethod
def from_cli_args(cls, args: argparse.Namespace) -> "TensorizerArgs":
# Get the list of attributes of this dataclass.
attrs = [attr.name for attr in dataclasses.fields(cls)]
# Set the attributes from the parsed arguments.
tensorizer_args = cls(
**{attr: getattr(args, attr) for attr in attrs if hasattr(args, attr)}
)
return tensorizer_args



class TensorizerAgent:
def __init__(self, model_cls: Type[nn.Module],
model_config: ModelConfig,
):
self.model_cls = model_cls
self.model_config = model_config
self.tensorizer_args = self.model_config.tensorizer_args
self.serialize_model = not self._verify_path_reachable()
self.model = self._init_model()

def _init_model(self):
model_args = self.model_config.hf_config
model_args.torch_dtype = self.model_config.dtype
model = no_init_or_tensor(lambda: self.model_cls(*[model_args]))
return model

def _verify_path_reachable(self):
if not self.tensorizer_args.tensorizer_uri.endswith(".tensors"):
raise ValueError(f"download_dir {self.tensorizer_args.tensorizer_uri} must specify a .tensors "
f"file when load_format = tensorizer")

def deserialize(self):
before_mem = get_mem_usage()
# Lazy load the tensors from S3 into the model.
start = time.time()
stream = stream_io.open_stream(self.tensorizer_args.tensorizer_uri, mode="rb", **self.tensorizer_args.credentials)
deserializer = TensorDeserializer(stream, **self.deserialize_args)
deserializer.load_into_module(self.model)
self.model = self.model.to(dtype=self.model_config.dtype)
end = time.time()

# Brag about how fast we are.
total_bytes_str = convert_bytes(deserializer.total_tensor_bytes)
duration = end - start
per_second = convert_bytes(deserializer.total_tensor_bytes / duration)
after_mem = get_mem_usage()
deserializer.close()
logger.info(
f"Deserialized {total_bytes_str} in {end - start:0.2f}s, {per_second}/s"
)
logger.info(f"Memory usage before: {before_mem}")
logger.info(f"Memory usage after: {after_mem}")

return self.model.eval()

# def serialize(self):
# with torch.device("cuda"):
# model = self.model_cls(self.model_config.hf_config)
# self.model_config.load_format = "auto"
# model.load_weights(
# self.model_config.model,
# self.model_config.download_dir,
# self.model_config.load_format,
# self.model_config.revision,
# )
# _make_model_contiguous(model)
# stream = stream_io.open_stream(self.tensorizer_args.download_dir, "wb", **self.credentials)
# serializer = TensorSerializer(stream, **self.serialize_args)
# logger.info(
# f"Serializing model tensors {self.model_config.model} to {self.tensorizer_args.download_dir}."
# )
# serializer.write_module(model)
# serializer.close()
# logger.info(
# f"Serialization complete. Running the previous command will deserialize the saved model weights."
# )
# return model.eval()


## Monkey patch for Parameter to ensure `requires_grad=False`
from torch.nn.parameter import Parameter

# Save the original __init__ method for later use
original_new = Parameter.__new__
#original_new = Parameter.__new__

def _new(cls, data, requires_grad=False):
return original_new(cls, data, requires_grad=requires_grad)

# Replace the original __init__ method with our new one
Parameter.__new__ = _new
#Parameter.__new__ = _new

def tensorizer_loader(params_dict):
return _TensorizerWeightsLoaderImpl(params_dict).context_manager()
Expand Down Expand Up @@ -168,6 +361,9 @@ def _torch_empty_substitute(*args, **kwargs):
args = ((*dimension, 0),)
return _torch_empty(device = "cuda", requires_grad = False, *args, **kwargs)




# def vpe_weight_loader(self, param: nn.Parameter, loaded_weight: torch.Tensor):
# param_data = param.data
# if self.input_is_parallel:
Expand Down
2 changes: 1 addition & 1 deletion vllm/model_executor/weight_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -298,7 +298,7 @@ def hf_model_weights_iterator(
deserializer_args = tensorizer_args.deserializer_params
credentials = tensorizer_args.credentials
stream = open_stream(tensorizer_args.tensorizer_uri, **credentials)
with TensorDeserializer(stream, **deserializer_args, device="cuda:0") as state:
with TensorDeserializer(stream, **deserializer_args, device="cpu") as state:
for name, param in state.items():
yield name, param
del state
Expand Down