[Feature]: Allow LoRA adapters to be specified as in-memory dict of tensors

### 🚀 The feature, motivation and pitch

PPO and a number of other LLM fine-tuning techniques require autoregressive generation as part of the training process. When using vLLM to speed up the autoregressive generation part of the training loop, is there an efficient way to update the weights of the LLM? Specifically, in the case of LoRA fine-tuning, is there a way to efficiently swap out the adapters without having to save them to the filesystem?

### Alternatives

## Efficient LoRA adapter update

Possible workaround without any code change: save adapters to an in-memory file-system (e.g., `/dev/shm`) and point to that directory in each LoRARequest. This workaround:
- Avoids disk read/write bottleneck and SSD wear.
- Still incurs the overhead of safetensors serialization and deserialization.

Proposed change: modify LoRARequest to allow adapters to be specified as a dictionary of tensors.
- Modify class definition of LoRARequest   
  - mark `lora_local_path: str` as optional
  - add new optional `lora_tensors: dict[str, torch.Tensor]` attribute.
- Modify WorkerLoRAManager `_load_lora` implementation ([vllm/lora/worker_manager.py](https://github.com/vllm-project/vllm/blob/2cd6b4f3625466eb5849bcfd7a6fb316735adab8/vllm/lora/worker_manager.py#L137)) 
  - verify that the given LoRARequest specifies exactly one of `lora_local_path` and `lora_tensors`.
  - optionally, move the logic for checking `unexpected_modules` into a separate method.
  - if `lora_tensors` is provided in the LoRARequest:
    - check for `unexpected_modules` in the given dict of tensors.
    - invoke `from_lora_tensors` instead of `from_local_checkpoint`.
    
## Alternative approach: non-LoRA parameter update

- [OpenRLHF](https://github.com/OpenLLMAI/OpenRLHF) replaces vLLM model parameters with in-memory tensors by overriding `hf_model_weights_iterator` and invoking `load_weights` for each tensor in the dict. ([source](https://github.com/OpenLLMAI/OpenRLHF/blob/9f1707/openrlhf/trainer/ray/ppo_actor.py#L123-L145), [patch](https://github.com/OpenLLMAI/OpenRLHF/blob/9f1707/openrlhf/trainer/ray/vllm_worker_wrap.py#L43)) 

### Additional context

LLM fine-tuning objectives such as PPO require autoregressive text generation during training, with the requirement that a reasonably up-to-date copy of the model is used for generation. 

As of the v0.4.0 vLLM release, when instantiating a vLLM `LoRARequest`, the LoRA adapters are specified through the `lora_local_path: str` attribute. ([source](https://github.com/vllm-project/vllm/blob/2cd6b4f3625466eb5849bcfd7a6fb316735adab8/vllm/lora/request.py#L20C5-L20C25)) In the LoRA PPO example above, if the vLLM instance is on the same machine as the `peft` training loop, sending a new copy of the adapter weights to vLLM would require the following steps:

- Invoke `peft.PeftModel.save_pretrained` to save the adapter tensor state dict (as `folder_name/adapter_model.safetensors`) to a local path on disk. Behind the scene, this method would:
  - Invoke `peft.utils.get_peft_model_state_dict` to obtain the tensor dict, and then 
  - Invoke `safetensors.torch.save_file` to serialize the lora tensors dict to filesystem. (**serialization** overhead)
- Instantiate a vLLM `LoRARequest` and set `lora_local_path` attribute to the updated value.
- Send this `LoRARequest` to the vLLM Engine. Behind the scene, vLLM would:
  - Invoke `LoRAModel.from_local_checkpoint` ([source](https://github.com/vllm-project/vllm/blob/2cd6b4f3625466eb5849bcfd7a6fb316735adab8/vllm/lora/models.py#L191))
  - Verify that all `target_modules` listed in the peft config are supported. 
  - Load lora tensors dict from filesystem into CPU memory. (**deserialization** overhead)
  - If additional embedding tensors are provided, load these into CPU memory also.
  - Invoke `LoRAModel.from_lora_tensors` ([source](https://github.com/vllm-project/vllm/blob/2cd6b4/vllm/lora/models.py#L131)) to instantiate the LoRAModel.

If the proposed alternative is adopted, the new workflow be like: 

- Invoke `peft.utils.get_peft_model_state_dict` on the LoRA model to obtain the lora tensors dict (same as the one written to disk in the workaround above.)
- Instantiate a vLLM LoRARequest and include a pointer to this lora tensors dict.
- Send this `LoRARequest` to the vLLM Engine. Behind the scene, vLLM would:
  - Invoke `LoRAModel.from_lora_tensors` ([source](https://github.com/vllm-project/vllm/blob/2cd6b4/vllm/lora/models.py#L131)) to instantiate the updated LoRAModel.


## Related Issues

The idea of adding new LoRA adapters without restarting vLLM is related to #3308 with some differences:
- LoRA adapters in this feature request are in memory on the same machine as the one running the vLLM server, whereas 3308 proposes loading new adapters from disk.
- This feature request primarily addresses the vLLM Python API, whereas 3308 addresses the OpenAI-compatible HTTP API.

If the changes proposed in this feature request are merged, these features could eventually be added to the vLLM OpenAI-compatible HTTP API to e.g., allow trusted remote users to add LoRA adapters to a vLLM server without first writing the adapters to a filesystem on the server.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Allow LoRA adapters to be specified as in-memory dict of tensors #4068

🚀 The feature, motivation and pitch

Alternatives

Efficient LoRA adapter update

Alternative approach: non-LoRA parameter update

Additional context

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Allow LoRA adapters to be specified as in-memory dict of tensors #4068

Description

🚀 The feature, motivation and pitch

Alternatives

Efficient LoRA adapter update

Alternative approach: non-LoRA parameter update

Additional context

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions