bernardladenthin · bernardladenthin · May 24, 2026 · May 24, 2026
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
 
-Current llama.cpp pinned version: **b9284**
+Current llama.cpp pinned version: **b9297**
 
 ## Upgrading CUDA Version
 
@@ -399,6 +399,15 @@ Also review the project `CMakeLists.txt` for build-system-level breaks (e.g. ren
 | ~b9279–b9284 | `tools/{batched-bench,cli,completion,fit-params,llama-bench,perplexity,quantize,server}/CMakeLists.txt` | Each `*-impl` target switched from `add_library(... STATIC ...)` to default library type (becomes SHARED when `BUILD_SHARED_LIBS=ON`); added `WINDOWS_EXPORT_ALL_SYMBOLS ON` and conditional `install(TARGETS ... LIBRARY)` under `LLAMA_TOOLS_INSTALL`. Project doesn't enable `LLAMA_BUILD_TOOLS`, so none of these targets are configured — no impact |
 | ~b9279–b9284 | `src/llama-vocab.cpp` + `conversion/base.py` | HybridDNA tokenizer fix: k-mers are now stored in `token_to_id` with a reserved `\xee\x80\x80` (U+E000) suffix to disambiguate them from identical base-vocab BPE tokens (e.g. `CCCCCC`); the suffix is stripped from `id_to_token` text after vocab load. Pure tokenizer internals, not exposed via JNI — no project changes required |
 | ~b9279–b9284 | `ggml/src/ggml-cuda/common.cuh` | PDL-launch gating now uses `ggml_cuda_highest_compiled_arch(cc) >= GGML_CUDA_CC_HOPPER` instead of the raw device cc — fixes false negatives when running on a Hopper device with a binary compiled for an older arch. Internal CUDA backend, no project changes required |
+| ~b9284–b9297 | upstream `CMakeLists.txt` | `LLAMA_BUILD_APP` default reverted from `ON` back to `${LLAMA_STANDALONE}` (i.e. OFF for FetchContent consumers). Project's `set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)` shim is now redundant but harmless; kept as defensive pin against future flips |
+| ~b9284–b9297 | `common/chat.h` + `tools/server/server-task.cpp` | New additive `common_chat_parser_params::is_continuation` field (default `false`); `params_from_json_cmpl` now parses the `continue_final_message` request field via `common_chat_continuation_parse()` and sets `is_continuation` when the result is non-`NONE`. `task_result_state` ctor guard tightened: the empty-prefill `chat_msg = common_chat_parse("", true, ...)` initialization is now gated on `is_continuation && !echo` (was just `!echo`) — i.e. the assistant-prefill suppression delta is only emitted when an actual continuation is requested. Java `InferenceParameters.setContinueFinalMessage(boolean\|ContinuationMode)` already writes `continue_final_message` to the request JSON, so behaviour is wired through automatically; non-continuation requests now correctly emit the first delta instead of suppressing it |
+| ~b9284–b9297 | `src/llama-model.{h,cpp}` + `src/models/qwen35.cpp` + `src/models/qwen35moe.cpp` | NVFP4 quantization extended to MTP (Multi-Token Prediction) tensors: `llama_layer_nextn` gains four scale fields (`eh_proj_s`, `eh_proj_in_s`, `shared_head_head_s`, `shared_head_head_in_s`); `load_tensors()` loads them when the corresponding base tensor exists and is NVFP4; Qwen3.5 / Qwen3.5-MoE MTP graphs pass the scales into `build_lora_mm()`. Internal model-loading + graph-building changes, no project changes required |
+| ~b9284–b9297 | `ggml/src/ggml-backend.cpp` | Bug fix in `ggml_backend_tensor_get_2d_async`: fast-path condition checked `iface.set_tensor_2d_async == NULL` (typo) instead of `iface.get_tensor_2d_async == NULL`; multi-copy gets now correctly fall back when the backend lacks `get_tensor_2d_async`. Also corrects an out-of-bounds assertion message from "write" to "read". Internal backend code, no project changes required |
+| ~b9284–b9297 | `ggml/src/ggml-opencl/` (`ggml-opencl.cpp` + 17 kernel files) | Adreno MoE pipeline bug fix: GEMM/GEMV kernels for MXFP4/Q4_0/Q4_1/Q4_K/Q5_0/Q5_1/Q5_K/Q6_K had a boundary-check race where the `ne01` bounds check exited threads early and prevented their participation in tile-wide reductions, causing wrong results when `ne01 % 64 != 0`. Fixed by: (1) rounding `global_size[0]` up to the next multiple of 64 in `ggml_cl_mul_mat_id`, (2) moving the per-thread `ne01` early-return in each GEMM kernel to AFTER the tile reduction, (3) adding the same early-return in the GEMV kernels and the cvt.cl trans4_ns/restore_ns kernels; alignment threshold also relaxed from `ne01 % 64 == 0` to `ne01 % 32 == 0` in `use_adreno_moe_kernels`. Internal OpenCL backend, affects the `opencl-android-aarch64` classifier build only — no project source changes |
+| ~b9284–b9297 | `ggml/src/ggml-sycl/` (`ggml-sycl.cpp`, `dmmv.cpp`, `gated_delta_net.cpp`, `common.hpp`) | (1) BF16 added to `ggml_sycl_supports_dmmv()` and `can_use_dequantize_mul_mat_vec()`; new `convert_mul_mat_vec_bf16_sycl` path. (2) Level Zero auto-detect moved into `ggml_sycl_init()` — `info.ext_oneapi_level_zero` flag now reflects the GPU-only check (CPU devices ignored) and is used as the default for `GGML_SYCL_ENABLE_LEVEL_ZERO` env. (3) `mmid_counting_sort_rows()` replaces the per-expert atomic scan in `ggml_sycl_mul_mat_id` — host-side counting sort builds expert-contiguous row slices in a single pass instead of N×expert atomic scans; significant speedup for MoE dispatch. (4) Gated-delta-net kernel extended with `keep_rs_t` template parameter and per-token snapshot writes when `K > 1`, matching the CUDA/Vulkan snapshot changes from b9222. Internal SYCL backend, no project changes required |
+| ~b9284–b9297 | `ggml/src/ggml-vulkan/CMakeLists.txt` | `find_package(SPIRV-Headers)` switched to `CONFIG REQUIRED` and adds `$ENV{VULKAN_SDK}` to `CMAKE_PREFIX_PATH`; fixes detection when SPIRV-Headers ships only the CMake-config files (no FindSPIRV-Headers.cmake). Internal Vulkan build config, no project changes required |
+| ~b9284–b9297 | `ggml/src/ggml-zendnn/` (`CMakeLists.txt`, `ggml-zendnn.cpp`) | ZenDNN bumped to ZenDNN-2026-WW19; Q8_0 weight support added for matmul and matmul_id paths via dynamic quantization (S8 compute, BF16 scales); ZenDNN matmul/matmul_id now handles `GGML_TYPE_Q8_0` with FP32 src1 directly without F32→Q8_0 conversion. Internal AMD ZenDNN backend, no project changes required |
+| ~b9284–b9297 | `tools/perplexity/perplexity.cpp` | `log_probs.resize(n_ctx * nv)` widened to `size_t(n_ctx) * nv` to avoid 32-bit overflow on large context sizes. Standalone tool not compiled by project, no impact |
 
 ## Build Commands
 

@@ -110,7 +110,7 @@ set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)
 FetchContent_Declare(
 	llama.cpp
 	GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
-	GIT_TAG        b9284
+	GIT_TAG        b9297
 )
 FetchContent_MakeAvailable(llama.cpp)
 

@@ -1,7 +1,7 @@
 **Build:**  
 ![Java 11+](https://img.shields.io/badge/Java-11%2B-informational)  
 ![JUnit](https://img.shields.io/badge/tested%20with-JUnit4-yellow)  
-[![llama.cpp b9284](https://img.shields.io/badge/llama.cpp-%23b9284-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9284)  
+[![llama.cpp b9297](https://img.shields.io/badge/llama.cpp-%23b9297-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9297)  
 [![Publish](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/publish.yml/badge.svg)](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/publish.yml)  
 [![CodeQL](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/codeql.yml/badge.svg)](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/codeql.yml)