Name and Version
llama-server --version
version: 8661 (b7ad48e)
Operating systems
Linux
GGML backends
CUDA
Hardware
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 4: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 5: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Models
ggml-org/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-Q4_K_M.gguf
Problem description & steps to reproduce
I run:
script to run llama.cpp: gemma-4-26B-A4B-it-q4_k_m.sh
#!/bin/bash
export PATH=/usr/local/cuda-12.8/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH
export CUDA_VISIBLE_DEVICES=2
LLAMA_SERVER_BIN="/storage/llm/llama.cpp/build/bin/llama-server"
MODEL_PATH="/storage/llm/models/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-Q4_K_M.gguf"
MMPROJ_PATH="/storage/llm/models/gemma-4-26B-A4B-it-GGUF/mmproj-gemma-4-26B-A4B-it-f16.gguf"
exec "$LLAMA_SERVER_BIN"
-m "$MODEL_PATH"
--mmproj "$MMPROJ_PATH"
--alias gemma-4-26b
--host 0.0.0.0
--port 8001
-np 1
-ngl 99
-fa on
-c 32768
-ctk q8_0
-ctv q8_0
-b 2048
--no-mmap
--no-warmup
It is added in litellm - output from gui
{
"input_cost_per_token": 0,
"output_cost_per_token": 0,
"api_base": "http://127.0.0.1:8001/v1",
"custom_llm_provider": "openai",
"use_in_pass_through": false,
"use_litellm_proxy": false,
"merge_reasoning_content_in_choices": false,
"tags": [],
"model": "gemma-4-26b",
"guardrails": [],
"vector_store_ids": []
}
model used trough litellm in Open WebUI ‧ v0.8.12
First Bad Commit
No response
Relevant log output
Logs
llama.cpp_gemma-4-26B-A4B-it-q4_k_m.txt
Name and Version
llama-server --version
version: 8661 (b7ad48e)
Operating systems
Linux
GGML backends
CUDA
Hardware
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 4: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 5: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Models
ggml-org/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-Q4_K_M.gguf
Problem description & steps to reproduce
I run:
script to run llama.cpp: gemma-4-26B-A4B-it-q4_k_m.sh
#!/bin/bash
export PATH=/usr/local/cuda-12.8/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH
export CUDA_VISIBLE_DEVICES=2
LLAMA_SERVER_BIN="/storage/llm/llama.cpp/build/bin/llama-server"
MODEL_PATH="/storage/llm/models/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-Q4_K_M.gguf"
MMPROJ_PATH="/storage/llm/models/gemma-4-26B-A4B-it-GGUF/mmproj-gemma-4-26B-A4B-it-f16.gguf"
exec "$LLAMA_SERVER_BIN"
-m "$MODEL_PATH"
--mmproj "$MMPROJ_PATH"
--alias gemma-4-26b
--host 0.0.0.0
--port 8001
-np 1
-ngl 99
-fa on
-c 32768
-ctk q8_0
-ctv q8_0
-b 2048
--no-mmap
--no-warmup
It is added in litellm - output from gui
{
"input_cost_per_token": 0,
"output_cost_per_token": 0,
"api_base": "http://127.0.0.1:8001/v1",
"custom_llm_provider": "openai",
"use_in_pass_through": false,
"use_litellm_proxy": false,
"merge_reasoning_content_in_choices": false,
"tags": [],
"model": "gemma-4-26b",
"guardrails": [],
"vector_store_ids": []
}
model used trough litellm in Open WebUI ‧ v0.8.12
First Bad Commit
No response
Relevant log output
Logs
llama.cpp_gemma-4-26B-A4B-it-q4_k_m.txt