An adapter model is a model with X-LoRA or LoRA. X-LoRA support is provided by selecting an XLora*
architecture, and LoRA support by selecting the Lora*
architecture. For both X-LoRA and LoRA, an ordering file (see this section for preparing the ordering file) must be provided. The ordering file describes the ordering of layers and which adapters to use (and what order to use them in for X-LoRA).
When using an adapter model with a quantized base model, if the ordering file specifies unsupported layers you will receive an error.
Llama architecture:
- model.layers.{layer_idx}.self_attn.q_proj
- model.layers.{layer_idx}.self_attn.k_proj
- model.layers.{layer_idx}.self_attn.v_proj
- model.layers.{layer_idx}.self_attn.o_proj
- model.layers.{layer_idx}.mlp.up_proj
- model.layers.{layer_idx}.mlp.down_proj
- model.layers.{layer_idx}.mlp.gate_proj
- lm_head
Phi 3 architecture:
- model.layers.{layer_idx}.self_attn.qkv_proj
- model.layers.{layer_idx}.self_attn.o_proj
- model.layers.{layer_idx}.mlp.gate_up_proj
- model.layers.{layer_idx}.mlp.down_proj
- lm_head
Preparing the X-LoRA/LoRA Ordering File
The X-LoRA/LoRA ordering file is necessary to prepare before inference with an X-LoRA model. However, it is easy with a provided script
!
An ordering JSON file for X-LoRA contains 2 major parts.
- The adapter names
order
- The order matters!
- Should be an array of strings which are the adapter names corresponding to the order the adapters were specified during training. For example, if the adapters were specified as a dictionary:
- The layer ordering
layers
- Automatically generated and should not be manipulated as it controls the application of scalings.
adapters = {
"math": ...,
"reasoning": ...,
"biology": ...
}
The specified order would be ["math", "reasoning", "biology"]
.
We provide an ordering file which contains the ordering for the X-LoRA model associated with the paper and the Huggingface repository: https://huggingface.co/lamm-mit/x-lora.
An ordering JSON file for LoRA contains 2 major parts:
- The adapter names
order
(optional):- The order does not matter
- Come controls which adapters will be initially activated
- If this key is not specified, then no adapters will be activated initially
- Preload adapter section
preload_adapters
(optional): see this section- Order does not matter
- Specifies the adapter name and the model ID to find them, which may be a local path.
There are 2 scripts to prepare the ordering file and which work for both X-LoRA and LoRA. The ordering file is specific to each architecture and set of target modules. Therefore, if either are changed, it is necessary to create a new ordering file using the first option. If only the adapter order or adapters changed, then it the second option should be used.
-
From scratch: No ordering file for the architecture and target modules
A script
create_ordering.py
is provided which prompts the user for the model ID, target modules, and adapter names. The user is prompted for an output file location, relative to the working directory. -
Create a new ordering file from an existing ordering file for an architecture and target modules
A script
set_names.py
is provided which prompts the user for the adapter names and the old ordering file. The user is prompted for an output file location, relative to the working directory.
Mistral.rs supports running quantized models with X-LoRA or LoRA. The X-LoRA or LoRA adapter layers will not be quantized, only the base model. P
In the X-LoRA case, please note that using a high quantization level (eg., 4-bit) can distort the signal and prevent the classifier from acting properly. Therefore, it is better to use slightly lower levels such as 8-bit.
The X-LoRA implementation supports non-granular scalings. This caches the scalings after k
completion tokens are generated and they will be used for the remaining passes avoiding the scaling pass. The number of tokens to generate before caching is defined by setting tgt_non_granular_index
. Setting tgt_non_granular_index
will restrict the maximum running sequences to 1.
Please see this page for more details and examples.
We support dynamic adapter activation for LoRA models, allowing you to activate a set of adapters at runtime. There is a Python, Rust and HTTP API:
To use this feature, you should add a preload_adapters
key to your ordering file:
{
"order": ["..."],
"layers": {"...": "123"},
"base_model_id": "...",
+ "preload_adapters": [{"name": "...", "adapter_model_id": "..."}] # New field here
}
This allows mistral.rs to preload the adapter and enable runtime activation.
We also provide a script to add this key to your existing order file: load_add_preload_adapters.py
.