Closed
Description
Problem statement:
In the production system, there should be an API to add\\remove fine-tuned weights dynamically. Inference caller should not have to specify LoRA location with each call.
Current Multi-LoRA support allows adaptor load during inference calls, which doesn't check if finetune weights are already loaded and ready for inferencing.
Proposal:
Introduce an API - /load and /unload to allow fine-tuned weights inclusions in vllm.
POST /load
-> add finetunes weight as part of models.
POST /unload
-> remove finetunes weight from models list.
This will allow the set of finetuned weights present in vllm server.
This will infer no need to specify finetune weight names, and locations as part of each inference request.
Sample code:
lora_request = None
index = 1
@app.post("/load")
async def load(request: Request) -> Response:
request_dict = await request.json()
global lora_request
lora_local_path = request_dict.pop("lora_path", "/models/lora/")
global index
lora_request = LoRARequest(
lora_name=lora_local_path,
lora_int_id=index,
lora_local_path=lora_local_path)
index = index + 1
return Response(status_code=201)
@app.post("/unload")
async def unload(request: Request) -> Response:
"""
Unload API
:param request:
:return:
"""
global lora_request
lora_request = None
global index
if not index <= 1:
index = index - 1
return Response(status_code=201)
Metadata
Metadata
Assignees
Labels
No labels