-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Frontend] [Core] feat: Add model loading using tensorizer
#3476
[Frontend] [Core] feat: Add model loading using tensorizer
#3476
Conversation
23fc7b1
to
7373013
Compare
ac8f32b
to
25e9fb7
Compare
@cadedaniel @rkooo567 Pinging for an assigned reviewer from someone on the team when possible! |
64e8637
to
23a6c03
Compare
@cadedaniel @rkooo567 @Yard1 @WoosukKwon @zhuohan123 @ywang96 All tests are passing. Can I get eyes on this please? Cheers! |
This feature allows vLLM models to be loaded extremely fast using `tensorizer`. `tensorizer` loads serialized model tensors from HTTP/HTTPS, Redis, S3 endpoints, or locally, typically on the scale of multiple GB/s.
This allows the deserializer to access S3 credentials when reading. It extracts the access key from env variables `S3_ACCESS_KEY_ID` and secret key from `S3_SECRET_ACCESS_KEY`.
Previous commit wasn't able to pass S3 credentials through `TensorDeserializer`, so it is instead passed to `stream_io.openstream` which is used to instantiate the `TensorDeserializer`.
Removed functionality to allow for tensorizing without `plaid_mode`, and updating to `tensorizer==2.8.0`
Replaces `download_dir` with `tensorizer_uri` as a `TensorizerArgs` param. This is due to discussions on `download_dir` being a confusing and ultimately not helpful parameter for the location of model tensors. Instead, `download_dir` is back to the definition coinciding with the convention set by HuggingFace; as a location to download weights for caching. A new parameter takes its place: `tensorizer_uri`. This specifically deals with locating model tensors for `tensorizer`.
Also slight formatting fix for warning when loading weights with `download_dir` not set to `None`.
Integrated changes from ssteel/tensorizer-support branch that allowed for deserializing vLLM models.
Integrated previous support for vLLM-formatted model loading with `tensorizer` that makes full use of loading to the GPU with `plaid-mode`, as well as falling back on being able to load HuggingFace models for serving using the CPU so that vLLM can perform its manual GPU loading.
Fixed some unnecessary formatting changes in `arg_utils.py`, `weight_utils.py` and `model_loader.py` fixed improperly passing `force_http` to `TensorDeserializer` rather than `open_stream`.
Misc. fixes from now resolved conversations. Mostly consisting of changes to syntax, style, adding docstrings, and versioning.
`examples/tensorize_vllm_model.py` now correctly instantiates vLLM-formatted models.
Replaced the model initialization process using `LLMEngine`, allowing vLLM to handle and therefore optimize the initial model loading process. Added testing for quantization.
Thank you all very much for your reviews! I've implemented the changes from @ywang96 's comments. To summarize:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you again @sangstar for all the work and test coverage on this PR to add this feature!
I was able to successfully test this in a vLLM 0.4.1 container running on OpenShift, both with models serialized with the Tensorizer library directly and for vLLM-serialized models. Once I cranked up the Pod's CPU and increased the
That's an awesome improvement, and thank you! |
I'm thrilled to hear that! I currently actually have a new PR up #4208 that uses the full |
Feature Request Issue
Tensorizer Support
This PR allows models used for the OpenAI-compatible API server to be loaded using
Coreweave's Tensorizer, enabling extremely fast (faster than cached Safetensors) model loads from HTTP/HTTPS,
Redis, and S3 endpoints.
The key changes involved are:
TensorizerConfig
to the set of configs.tensorizer_loader.py
tovllm/model_executor
that provides utility functionsfor tensorizer.
the user to specify the path to serialized-by-tensorizer model tensors, as well as
arguments for tensorizer's deserializer.
as well as supporting deserializing serialized vLLM-formatted models, allowing the
use of loading with
plaid_mode
, which can allow Llama 2 13B non-locally to start serving requestsin as little as 10 seconds. Also supports encrypting and decrypting model tensors.
tensorize_vllm_model.py
script toexamples/
that allows vLLM models to be serialized anddeserialized with
tensorizer
.tensorizer
as an optional dependency.Credentialing for S3 is supported by passing a user's access and secret key to
S3_ACCESS_KEY_ID
andS3_SECRET_ACCESS_KEY
environment variables respectively. It can also be specified as CLI args to the api server entrypoint.Model loading benchmarks
Tensorizer can load models like Llama 2 13B in as little as 10 seconds. In order to do so, a model must be
serialized using
TensorSerializer
to a.tensors
file located either locally or through a S3, HTTP/HTTPS, or Redisendpoint.
--tensorizer-uri
must be specified with the serialized tensors location when invoking the API server.Example usage:
If a vLLM model is serialized,
plaid_mode
can be used, which loads much faster. The following plot demonstrates model loading time benchmarks for vLLM's OpenAI-compatible inference server on a Nvidia A40 GPU.Tensorizer is so fast that it loads models faster than Safetensors even locally.