Skip to content

Conversation

@avion23
Copy link

@avion23 avion23 commented Jan 1, 2026

Bindings were 5 months outdated, preventing newer model architectures from loading.

Updates bindings to llama.cpp commit be47fb92 (2026-01-01).

Removed

  • 14 llama_kv_self_* functions (use llama_memory_* API)
  • llama_sampler_init_softmax()

Added

Enums:

  • LLAMA_ROPE_TYPE_IMROPE
  • llama_flash_attn_type
  • llama_params_fit_status
  • llama_model_meta_key

Struct fields:

  • llama_model_params: no_host, no_alloc
  • llama_context_params: flash_attn_type (replaced flash_attn bool)

Functions:
llama_max_tensor_buft_overrides, llama_n_ctx_seq, llama_model_n_embd_inp, llama_model_is_hybrid, llama_flash_attn_type_name, llama_model_meta_key_str, llama_adapter_meta_* (5 functions), llama_log_get, llama_log_set, llama_memory_breakdown_print

Breaking Changes

flash_attn parameter:

# Old
params.flash_attn = True
# New
params.flash_attn_type = LLAMA_FLASH_ATTN_TYPE_ENABLED

KV cache API:

# Old
llama_kv_self_clear(ctx)
# New
llama_memory_clear(mem, data=True)

Other

  • Added ggml_log_callback typedef
  • Fixed LLAVA/mtmd build (set LLAMA_INSTALL_VERSION before subdirectory include)
  • Version 0.3.16 → 0.4.0

Tested: macOS ARM64 Metal, Python 3.14, Nemotron-3-Nano-30B

@avion23 avion23 marked this pull request as draft January 1, 2026 19:40
- Update llama.cpp submodule (2025-08-14 → 2026-01-01)
- Remove deprecated KV cache functions (use llama_memory_* instead)
- Remove llama_sampler_init_softmax (deprecated)
- Add LLAMA_ROPE_TYPE_IMROPE constant
- Add llama_flash_attn_type enum (AUTO/DISABLED/ENABLED)
- Add llama_params_fit_status enum
- Add llama_model_meta_key enum for sampling metadata
- Add llama_model_params fields: no_host, no_alloc
- Replace llama_context_params.flash_attn bool with flash_attn_type enum
- Add 15 new API functions:
  - llama_max_tensor_buft_overrides
  - llama_n_ctx_seq
  - llama_model_n_embd_inp
  - llama_model_is_hybrid
  - llama_flash_attn_type_name
  - llama_model_meta_key_str
  - llama_adapter_meta_* functions (5)
  - llama_log_get/set
  - llama_memory_breakdown_print
- Add ggml_log_callback typedef
- Disable LLAVA build (CMake incompatibility in upstream mtmd)
- Bump version 0.3.16 → 0.4.0

Breaking changes:
- flash_attn bool removed, use flash_attn_type enum
- KV cache functions removed, use llama_memory_* API

Tested with Nemotron-3-Nano-30B hybrid model.
@avion23 avion23 force-pushed the update-llama-cpp-2026-01 branch from 502532a to 23c10e8 Compare January 1, 2026 19:50
@avion23 avion23 marked this pull request as ready for review January 1, 2026 19:52
@avion23
Copy link
Author

avion23 commented Jan 1, 2026

Tested on macos using CMAKE_ARGS="-DGGML_METAL=on" pip3.14 install --force-reinstall --no-cache-dir "llama-cpp-python @ git+https://github.com/avion23/llama-cpp-python.git@update-llama-cpp-2026-01" --break-system-packages

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant