Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support LoRA hotswapping and multiple LoRAs at a time #1817

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

richdougherty
Copy link

@richdougherty richdougherty commented Oct 30, 2024

This is a PR to add support for loading and changing LoRA adapters at runtime as introduced into llama.cpp in ggerganov/llama.cpp#8332 by @ngxson. Adding this support should allow things like loading a base model, then swapping adapters in and out to support different features and behaviours. This could be really useful in smaller environments where we might use smaller models but want to support a variety of capabilities. (This appears to be the approach taken by some commercial mobile device makers.)

The list of changes from upstream in ggerganov/llama.cpp#8332 are:

  • Refactor lora API
  • Allow hot-swapping lora
  • Added struct llama_lora_adapter to keep track of loaded lora

This PR is just a draft to show what I'm working on and get some feedback on the API, approach, etc. I do plan on tidying it up, squashing commits, and going through all the different bits of code and check they all work. If there's anything you'd like me to do please let me know!

For now I have got working something like this:

# Basing off some of the models tested here:
# https://github.com/predibase/lora_bakeoff
model_file_path = '.../mistral-7b-v0.1.Q4_K_S.gguf'
adapter_file_paths = [
    '.../magicoder-lora-mistral-7b-v0.1.gguf',
    '.../conllpp-lora-mistral-7b-v0.1.gguf',
]

llm = llama_cpp.Llama(
    model_path=model_file_path,
    lora_adapters=dict(map(lambda x: (x, 0.0), adapter_file_paths)),
)
for adapter_file_path in adapter_file_paths:
    # Clear adapters
    for lora_path in adapter_file_paths:
        llm.set_lora_adapter_scale(lora_path, 0)
    # Set only one adapter
    llm.set_lora_adapter_scale(adapter_file_path, 1.0)

    completion = llm.create_completion(
        seed=42,
        temperature=0,
        **task
    )
    print(completion['choices'][0]['text'])

Tasks:

  • Basic low-level support - new LlamaLoraAdapter class, methods in LlamaContext
  • Updated to new APIs
  • Support loading multiple LoRAs and runtime hot-swapping in Llama - new lora_adapters param and set_lora_adapter_scaling method
  • Updated command line args to match upstream - support multiple --lora, remove --lora-base, add --lora-scaled
  • Test prefix caching with swapped LoRAs - prefix only applies with
  • Test cache with swapped LoRAs - possible code added, not tested
  • Test disk cache
  • Test state saving
  • Test/update server configuration
  • Test low level chat API (existing code doesn't work so can't test properly but executes past LoRA setting)
  • General clean up and consistency
  • Support for selecting LoRAs via server endpoints (maybe later PR)
  • Squash commits

@richdougherty
Copy link
Author

richdougherty commented Nov 2, 2024

Still working on this. Just added support to the OpenAI-compatible server for hot-swapping LoRAs via model aliases. This allows fast serving of different LoRA adapters that extend the same base model with minimal switching overhead.

{
    "host": "0.0.0.0",
    "port": 8080,
    "models": [
        {
          "model_alias": "mistral",
          "model": "./mistral-7b-v0.1.Q4_K_S.gguf",
          "verbose": true
        },
        {
          "model_alias": "mistral-magicoder",
          "model": "./mistral-7b-v0.1.Q4_K_S.gguf",
          "lora_adapters": {
            "./magicoder-lora-mistral-7b-v0.1.gguf": 1.0
          },
          "verbose": true
        },
        {
          "model_alias": "mistral-conllpp",
          "model": "./mistral-7b-v0.1.Q4_K_S.gguf",
          "lora_adapters": {
            "./conllpp-lora-mistral-7b-v0.1.gguf": 1.0
          },
          "verbose": true
        }
    ]
}

Then calling the OpenAI compatible API with "model": "mistral, "model": "mistral-magicoder, "model": "mistral-conllpp" will result in a hot-swap, e.g

Hot-swapping model, setting existing LoRA adapter scales to 0.0.
Hot-swapping model, setting LoRA adapter scales for mistral-conllpp.
llama_lora_adapter_init_internal: loading lora adapter from './conllpp-lora-mistral-7b-v0.1.gguf' ...
llama_lora_adapter_init_internal: CPU_Mapped LoRA buffer size =    13.00 MiB
llama_lora_adapter_init_internal: loaded 128 tensors from lora file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant