Skip to content

Misc. bug: "llama_context_params::swa_full = true" causes very large RAM/VRAM usage #14123

Open
@k4ss4n

Description

@k4ss4n

Name and Version

local system:
version: 5630 (4c763c8)
bug report:
version: 5631 (1f7d50b)

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

libllama (core library)

Command line

./llama-server -m ./gemma-3-12b-it-Q4_K_M.gguf --threads 12 --ctx_size 50000 --n_gpu_layers 50 --offline --rope_freq_scale 0.9

Problem description & steps to reproduce

#include <iostream>
#include "llama.h"

int main(){
    llama_context_params ctx_params = llama_context_default_params();

    ctx_params.n_ctx = 50000;
    ctx_params.n_threads = 12;
    ctx_params.rope_freq_scale = 0.9;

    //ctx_params.swa_full = false;

    llama_model_params model_params = llama_model_default_params();
    model_params.n_gpu_layers = 50;

    llama_model* model = llama_model_load_from_file("gemma-3-12b-it-Q4_K_M.gguf", model_params);

    if (!model) {
        std::cout << "Failed to initialize llama model.";
        return -1;
    }

    llama_context* context = llama_init_from_model(model, ctx_params);

    if (!context) {
        std::cout << "Failed to initialize llama context.";
        llama_model_free(model);
        return -2;
    }

    std::cout << "context initialized";

    llama_free(context);
    llama_model_free(model);

    return 0;
}

fails with

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 15630.00 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate ROCm0 buffer of size 16389242880
llama_init_from_model: failed to initialize the context: failed to allocate buffer for kv cache

I can run the same model with the same parameters using llama-server of the same llama.cpp build with a much lower memory footprint. I expected similar behavior using libllama.so in c++ with most parameters left at default. Observed is a maximum context size of about 10000 instead of 50000 when using gpu acceleration and a very large memory usage when forcing 0 gpu layers with context size 50000. Setting

ctx_params.swa_full = false;

causes memory usage to align with llama-server's memory usage, model load and context creation succeed. Model load messages of above code and llama-server are vastly different(see attached files). As i suspected my local llama.cpp build might be at fault, i tried(!) to force compiling from llama.cpp git repo with some checks for prerequisites with this CMakeLists.txt, outputs attached as well. Local system llama.cpp and source built llama.cpp display identical behavior.

attached files:

models tested:

specs:

  • Ryzen 5600X
  • Radeon 9070
  • 32GB RAM

I don't know how to debug or resolve this. I was just trying to follow a llama.cpp tutorial and leave a comment with a workaround for others inspecting their logs that lead to this thread. Opened a new issue as @ddh0 suggested. Please advise or point out errors as you see fit.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions