Misc. bug: "llama_context_params::swa_full = true" causes very large RAM/VRAM usage

### Name and Version

local system:
version: 5630 (4c763c8d1)
bug report:
version: 5631 (1f7d50b2)

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

libllama (core library)

### Command line

```shell
./llama-server -m ./gemma-3-12b-it-Q4_K_M.gguf --threads 12 --ctx_size 50000 --n_gpu_layers 50 --offline --rope_freq_scale 0.9
```

### Problem description & steps to reproduce

```
#include <iostream>
#include "llama.h"

int main(){
    llama_context_params ctx_params = llama_context_default_params();

    ctx_params.n_ctx = 50000;
    ctx_params.n_threads = 12;
    ctx_params.rope_freq_scale = 0.9;

    //ctx_params.swa_full = false;

    llama_model_params model_params = llama_model_default_params();
    model_params.n_gpu_layers = 50;

    llama_model* model = llama_model_load_from_file("gemma-3-12b-it-Q4_K_M.gguf", model_params);

    if (!model) {
        std::cout << "Failed to initialize llama model.";
        return -1;
    }

    llama_context* context = llama_init_from_model(model, ctx_params);

    if (!context) {
        std::cout << "Failed to initialize llama context.";
        llama_model_free(model);
        return -2;
    }

    std::cout << "context initialized";

    llama_free(context);
    llama_model_free(model);

    return 0;
}

```
fails with
```
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 15630.00 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate ROCm0 buffer of size 16389242880
llama_init_from_model: failed to initialize the context: failed to allocate buffer for kv cache
```
I can run the same model with the same parameters using llama-server of the same llama.cpp build with a much lower memory footprint. I expected similar behavior using libllama.so in c++ with most parameters left at default. Observed is a maximum context size of about 10000 instead of 50000 when using gpu acceleration and a *very* large memory usage when forcing 0 gpu layers with context size 50000. Setting
```
ctx_params.swa_full = false;
```
causes memory usage to align with llama-server's memory usage, model load and context creation succeed. Model load messages of above code and llama-server are vastly different(see attached files). As i suspected my local llama.cpp build might be at fault, i tried(!) to force compiling from llama.cpp git repo with some checks for prerequisites with [this CMakeLists.txt](https://github.com/k4ss4n/lbr/blob/master/CMakeLists.txt), outputs attached as well. Local system llama.cpp and source built llama.cpp display identical behavior.

attached files:

- [terminal_output_swa_full_true_50k.txt](https://github.com/user-attachments/files/20687040/terminal_output_swa_full_true_50k.txt) output of code above with `swa_full = true` and 50k context size
- [terminal_output_swa_full_true_10k.txt](https://github.com/user-attachments/files/20687067/terminal_output_swa_full_true_10k.txt) output of code above with `swa_full = true` and 10k context size
- [terminal_output_swa_full_false_50k.txt](https://github.com/user-attachments/files/20687093/terminal_output_swa_full_false_50k.txt) output of above code with `swa_full = false` and 50k context size
- [terminal_output_llama_server_50k.txt](https://github.com/user-attachments/files/20687102/terminal_output_llama_server_50k.txt) output of llama-server with the same model and parameters(see command line provided)
- [cmake_output.txt](https://github.com/user-attachments/files/20686980/cmake_output.txt) produced by [CMakeLists.txt](https://github.com/k4ss4n/lbr/blob/master/CMakeLists.txt)
- [build_output.txt](https://github.com/user-attachments/files/20687025/build_output.txt) resulting compilation output

models tested:
- gemma-3-12b-it-Q4_K_M.gguf
- gemma-3-12b-it-UD-Q6_K_XL.gguf
- [link to download location](https://huggingface.co/unsloth/gemma-3-12b-it-GGUF)

specs:
- Ryzen 5600X
- Radeon 9070
- 32GB RAM

I don't know how to debug or resolve this. I was just trying to follow a llama.cpp tutorial and leave a comment with a workaround for others inspecting their logs that lead to [this thread](https://github.com/ggml-org/llama.cpp/pull/13194). Opened a new issue as @ddh0 suggested. Please advise or point out errors as you see fit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: "llama_context_params::swa_full = true" causes very large RAM/VRAM usage #14123

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: "llama_context_params::swa_full = true" causes very large RAM/VRAM usage #14123

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions