Skip to content

LORA Adapter Hot Swap Implementation Problem #10374

Closed
@michaellin99999

Description

@michaellin99999

I have been following the discussions in the following threads:

Pull Request #8332
Pull Request #8857
I believe that the ideal implementation of "hot swap" should address the following scenario:

When processing a request, llama.cpp should be able to dynamically determine and apply the correct LoRA adapter based on the specific requirements of the request. While I understand that the current implementation involves a scaling mechanism, this approach introduces significant issues.

For example, when llama.cpp is running as a server handling multiple simultaneous requests with different LoRA adapters, the scaling method creates a problematic dependency. If Request 1 comes in requiring LoRA Adapter 1, the scaling is adjusted to prioritize Adapter 1. However, if Request 2 arrives shortly afterward, requiring LoRA Adapter 2, the scaling is adjusted again, effectively disabling Adapter 1 in favor of Adapter 2. This adjustment disrupts Request 1 if it is still in the middle of processing.

This issue becomes even more pronounced in streaming scenarios where a high volume of concurrent requests are being processed, as is often the case with production-level systems.

Why must LoRA adapters rely on scaling adjustments? Why can’t they be separated and applied independently per request? In both threads (#8332 and #8857), I see other users emphasizing that the entire purpose of hot swap functionality is to enable per-request adapter switching. Yet, the authors repeatedly suggest that merging should happen beforehand, citing computational expense. I see the authors practically shutting down the other users suggesting this change.

However, the whole point of hot swap is precisely to avoid merging, as this is impractical in many real-world applications. Whether for runtime environments, pre-deployment preparations, or edge devices, merging is often not feasible—especially when considering dynamic content updates or systems with continuously expanding features.

For example, in a system where NPCs need to roleplay various characters that can be expanded or updated, hot swapping LoRA adapters on a per-request basis is essential.

I also note that this hot swap functionality is already implemented in frameworks like ollama and vLLM. Why, then, has it not been properly implemented in llama.cpp? (Or perhaps I’ve missed something and this feature already exists—if so, I’d appreciate guidance on how to use it). At the moment, however, I do not see this capability.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions