Update to llama.cpp 2026-01-01 #2108

avion23 · 2026-01-01T18:07:07Z

Bindings were 5 months outdated, preventing newer model architectures from loading.

Updates bindings to llama.cpp commit be47fb92 (2026-01-01).

Removed

14 llama_kv_self_* functions (use llama_memory_* API)
llama_sampler_init_softmax()

Added

Enums:

LLAMA_ROPE_TYPE_IMROPE
llama_flash_attn_type
llama_params_fit_status
llama_model_meta_key

Struct fields:

llama_model_params: no_host, no_alloc
llama_context_params: flash_attn_type (replaced flash_attn bool)

Functions:
llama_max_tensor_buft_overrides, llama_n_ctx_seq, llama_model_n_embd_inp, llama_model_is_hybrid, llama_flash_attn_type_name, llama_model_meta_key_str, llama_adapter_meta_* (5 functions), llama_log_get, llama_log_set, llama_memory_breakdown_print

Breaking Changes

flash_attn parameter:

# Old
params.flash_attn = True
# New
params.flash_attn_type = LLAMA_FLASH_ATTN_TYPE_ENABLED

KV cache API:

# Old
llama_kv_self_clear(ctx)
# New
llama_memory_clear(mem, data=True)

Other

Added ggml_log_callback typedef
Fixed LLAVA/mtmd build (set LLAMA_INSTALL_VERSION before subdirectory include)
Version 0.3.16 → 0.4.0

Tested: macOS ARM64 Metal, Python 3.14, Nemotron-3-Nano-30B

- Update llama.cpp submodule (2025-08-14 → 2026-01-01) - Remove deprecated KV cache functions (use llama_memory_* instead) - Remove llama_sampler_init_softmax (deprecated) - Add LLAMA_ROPE_TYPE_IMROPE constant - Add llama_flash_attn_type enum (AUTO/DISABLED/ENABLED) - Add llama_params_fit_status enum - Add llama_model_meta_key enum for sampling metadata - Add llama_model_params fields: no_host, no_alloc - Replace llama_context_params.flash_attn bool with flash_attn_type enum - Add 15 new API functions: - llama_max_tensor_buft_overrides - llama_n_ctx_seq - llama_model_n_embd_inp - llama_model_is_hybrid - llama_flash_attn_type_name - llama_model_meta_key_str - llama_adapter_meta_* functions (5) - llama_log_get/set - llama_memory_breakdown_print - Add ggml_log_callback typedef - Disable LLAVA build (CMake incompatibility in upstream mtmd) - Bump version 0.3.16 → 0.4.0 Breaking changes: - flash_attn bool removed, use flash_attn_type enum - KV cache functions removed, use llama_memory_* API Tested with Nemotron-3-Nano-30B hybrid model.

avion23 · 2026-01-01T20:07:17Z

Tested on macos using CMAKE_ARGS="-DGGML_METAL=on" pip3.14 install --force-reinstall --no-cache-dir "llama-cpp-python @ git+https://github.com/avion23/llama-cpp-python.git@update-llama-cpp-2026-01" --break-system-packages

avion23 marked this pull request as draft January 1, 2026 19:40

avion23 force-pushed the update-llama-cpp-2026-01 branch from 502532a to 23c10e8 Compare January 1, 2026 19:50

avion23 marked this pull request as ready for review January 1, 2026 19:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update to llama.cpp 2026-01-01 #2108

Update to llama.cpp 2026-01-01 #2108

avion23 commented Jan 1, 2026 •

edited

Loading

Uh oh!

avion23 commented Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Update to llama.cpp 2026-01-01 #2108

Are you sure you want to change the base?

Update to llama.cpp 2026-01-01 #2108

Conversation

avion23 commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Removed

Added

Breaking Changes

Other

Uh oh!

avion23 commented Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

avion23 commented Jan 1, 2026 •

edited

Loading