-
-
Notifications
You must be signed in to change notification settings - Fork 9.8k
Description
This RFC proposes improvements to the management of Low-Rank Adaptation (LoRA) in vLLM to make it more suitable for production environments. This proposal aims to address several pain points observed in the current implementation. Feedback and discussions are welcome, and we hope to gather input and refine the proposal based on community insights.
Motivation.
This RFC proposes improvements to the management of Low-Rank Adaptation (LoRA) in vLLM to make it more suitable for production environments. This proposal aims to address several pain points observed in the current implementation. Feedback and discussions are welcome, and we hope to gather input and refine the proposal based on community insights.
Motivation
LoRA integration in production environments faces several challenges that need to be addressed to ensure smooth and efficient deployment and management. The main issues observed include:
-
Visibility of LoRA Information: Currently, the relationship between LoRA and base models is not exposed clearly by the engine. The
/v1/models
endpoint does not display this information. Related issues: [Feature]: Expose Lora lineage information from /v1/models #6274 -
Dynamic Loading and Unloading: LoRA adapters cannot be dynamically loaded or unloaded after the server has started. Related issues: Multi-LoRA - Support for providing /load and /unload API #3308 [Feature]: Allow LoRA adapters to be specified as in-memory dict of tensors #4068 [Feature]: load/unload API to run multiple LLMs in a single GPU instance #5491
-
Remote Registry Support: LoRA adapters cannot be pulled from remote model repositories during runtime, making it cumbersome to manage artifacts locally. Related issues: [Feature]: Support loading lora adapters from HuggingFace in runtime #6233 [Bug]: relative path doesn't work for Lora adapter model #6231
-
Observability: There is a lack of metrics and observability enhancements related to LoRA, making it difficult to monitor and manage.
-
Cluster level Support: Information about LoRA is not easily accessible to resource managers, hindering support for service discovery, load balancing, and scheduling in cluster environments. Related issues: [RFC]: Add control panel support for vLLM #4873
Proposed Change.
1. Support Dynamically Loading or Unloading LoRA Adapters
To enhance flexibility and manageability, we propose introducing the ability to dynamically load and unload LoRA adapters at runtime.
- Expose
/v1/add_adapter
and/v1/remove_adapter
inapi_server.py
. - Introducing lazy and eager loading modes for LoRA adapters will provide more flexibility in deployment strategies. If lazy mode is selected, we can simply add lora to
LoraRequest
, otherwise, we should let theengine
to load the lora vialora_manager
explicitly.
2. Load LoRA Adapters from Remote Storage
Enabling LoRA adapters to be loaded from remote storage during runtime will simplify artifact management and deployment processes. The technical detail could be adding get_adapter_absolute_path
,
- it can expand relative path
- It can download hugging face models and return the snapshot path
- Refactor the lora path reference from
loral_local_path
tolocal_path
3. Build Better LoRA Model Lineage
To improve the visibility and management of LoRA models, we propose building a more robust model lineage metadata. This system will:
- Update
LoRAParserAction
to support json , we need to ask user to explicitly specify the base modelhttps://github.com/Jeffwan/vllm/blob/dd793d1de59b5efad25f4794b68cb935824c7a11/vllm/entrypoints/openai/cli_args.py#L16-L23 - Introduce
BaseModelPath
to replaceserved_model_names
https://github.com/Jeffwan/vllm/blob/dd793d1de59b5efad25f4794b68cb935824c7a11/vllm/entrypoints/openai/serving_engine.py#L33. It would be great to pass the model path and model names separately - Update
show_available_models
to updateroot
andparent
https://github.com/Jeffwan/vllm/blob/dd793d1de59b5efad25f4794b68cb935824c7a11/vllm/entrypoints/openai/serving_engine.py#L61-L62
4. Lora Observability enhancement
Improving observability by adding metrics specific to LoRA will help in better monitoring and management. Proposed metrics include:
- Loading and unloading times for LoRA adapters.
- Memory and compute resource usage by LoRA adapters.
- Performance impact on base models when using LoRA adapters.
5. Control Plane support(service discovery, load balancing, scheduling) for Loras
Since vLLM community focus more on the inference engine, the cluster level features would be a separate design I am working on in Kubernetes WG-Serving. I will link back to this issue shortly.
PR List
- [Doc] Fix the lora adapter path in server startup script #6230
- [Core] Support dynamically loading Lora adapter from HuggingFace #6234
- [Core] Support load and unload LoRA in api server #6566
- [Core] Support Lora lineage and base model metadata management #6315
Feedback Period.
No response
CC List.
Note: Please help tag the right person who worked in this area.
Any Other Things.
No response