Description
Name and Version
local system:
version: 5630 (4c763c8)
bug report:
version: 5631 (1f7d50b)
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
libllama (core library)
Command line
./llama-server -m ./gemma-3-12b-it-Q4_K_M.gguf --threads 12 --ctx_size 50000 --n_gpu_layers 50 --offline --rope_freq_scale 0.9
Problem description & steps to reproduce
#include <iostream>
#include "llama.h"
int main(){
llama_context_params ctx_params = llama_context_default_params();
ctx_params.n_ctx = 50000;
ctx_params.n_threads = 12;
ctx_params.rope_freq_scale = 0.9;
//ctx_params.swa_full = false;
llama_model_params model_params = llama_model_default_params();
model_params.n_gpu_layers = 50;
llama_model* model = llama_model_load_from_file("gemma-3-12b-it-Q4_K_M.gguf", model_params);
if (!model) {
std::cout << "Failed to initialize llama model.";
return -1;
}
llama_context* context = llama_init_from_model(model, ctx_params);
if (!context) {
std::cout << "Failed to initialize llama context.";
llama_model_free(model);
return -2;
}
std::cout << "context initialized";
llama_free(context);
llama_model_free(model);
return 0;
}
fails with
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 15630.00 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate ROCm0 buffer of size 16389242880
llama_init_from_model: failed to initialize the context: failed to allocate buffer for kv cache
I can run the same model with the same parameters using llama-server of the same llama.cpp build with a much lower memory footprint. I expected similar behavior using libllama.so in c++ with most parameters left at default. Observed is a maximum context size of about 10000 instead of 50000 when using gpu acceleration and a very large memory usage when forcing 0 gpu layers with context size 50000. Setting
ctx_params.swa_full = false;
causes memory usage to align with llama-server's memory usage, model load and context creation succeed. Model load messages of above code and llama-server are vastly different(see attached files). As i suspected my local llama.cpp build might be at fault, i tried(!) to force compiling from llama.cpp git repo with some checks for prerequisites with this CMakeLists.txt, outputs attached as well. Local system llama.cpp and source built llama.cpp display identical behavior.
attached files:
- terminal_output_swa_full_true_50k.txt output of code above with
swa_full = true
and 50k context size - terminal_output_swa_full_true_10k.txt output of code above with
swa_full = true
and 10k context size - terminal_output_swa_full_false_50k.txt output of above code with
swa_full = false
and 50k context size - terminal_output_llama_server_50k.txt output of llama-server with the same model and parameters(see command line provided)
- cmake_output.txt produced by CMakeLists.txt
- build_output.txt resulting compilation output
models tested:
- gemma-3-12b-it-Q4_K_M.gguf
- gemma-3-12b-it-UD-Q6_K_XL.gguf
- link to download location
specs:
- Ryzen 5600X
- Radeon 9070
- 32GB RAM
I don't know how to debug or resolve this. I was just trying to follow a llama.cpp tutorial and leave a comment with a workaround for others inspecting their logs that lead to this thread. Opened a new issue as @ddh0 suggested. Please advise or point out errors as you see fit.