Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co

Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.

Current llama.cpp pinned version: **b9284**
Current llama.cpp pinned version: **b9297**

## Upgrading CUDA Version

Expand Down Expand Up @@ -399,6 +399,15 @@ Also review the project `CMakeLists.txt` for build-system-level breaks (e.g. ren
| ~b9279–b9284 | `tools/{batched-bench,cli,completion,fit-params,llama-bench,perplexity,quantize,server}/CMakeLists.txt` | Each `*-impl` target switched from `add_library(... STATIC ...)` to default library type (becomes SHARED when `BUILD_SHARED_LIBS=ON`); added `WINDOWS_EXPORT_ALL_SYMBOLS ON` and conditional `install(TARGETS ... LIBRARY)` under `LLAMA_TOOLS_INSTALL`. Project doesn't enable `LLAMA_BUILD_TOOLS`, so none of these targets are configured — no impact |
| ~b9279–b9284 | `src/llama-vocab.cpp` + `conversion/base.py` | HybridDNA tokenizer fix: k-mers are now stored in `token_to_id` with a reserved `\xee\x80\x80` (U+E000) suffix to disambiguate them from identical base-vocab BPE tokens (e.g. `CCCCCC`); the suffix is stripped from `id_to_token` text after vocab load. Pure tokenizer internals, not exposed via JNI — no project changes required |
| ~b9279–b9284 | `ggml/src/ggml-cuda/common.cuh` | PDL-launch gating now uses `ggml_cuda_highest_compiled_arch(cc) >= GGML_CUDA_CC_HOPPER` instead of the raw device cc — fixes false negatives when running on a Hopper device with a binary compiled for an older arch. Internal CUDA backend, no project changes required |
| ~b9284–b9297 | upstream `CMakeLists.txt` | `LLAMA_BUILD_APP` default reverted from `ON` back to `${LLAMA_STANDALONE}` (i.e. OFF for FetchContent consumers). Project's `set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)` shim is now redundant but harmless; kept as defensive pin against future flips |
| ~b9284–b9297 | `common/chat.h` + `tools/server/server-task.cpp` | New additive `common_chat_parser_params::is_continuation` field (default `false`); `params_from_json_cmpl` now parses the `continue_final_message` request field via `common_chat_continuation_parse()` and sets `is_continuation` when the result is non-`NONE`. `task_result_state` ctor guard tightened: the empty-prefill `chat_msg = common_chat_parse("", true, ...)` initialization is now gated on `is_continuation && !echo` (was just `!echo`) — i.e. the assistant-prefill suppression delta is only emitted when an actual continuation is requested. Java `InferenceParameters.setContinueFinalMessage(boolean\|ContinuationMode)` already writes `continue_final_message` to the request JSON, so behaviour is wired through automatically; non-continuation requests now correctly emit the first delta instead of suppressing it |
| ~b9284–b9297 | `src/llama-model.{h,cpp}` + `src/models/qwen35.cpp` + `src/models/qwen35moe.cpp` | NVFP4 quantization extended to MTP (Multi-Token Prediction) tensors: `llama_layer_nextn` gains four scale fields (`eh_proj_s`, `eh_proj_in_s`, `shared_head_head_s`, `shared_head_head_in_s`); `load_tensors()` loads them when the corresponding base tensor exists and is NVFP4; Qwen3.5 / Qwen3.5-MoE MTP graphs pass the scales into `build_lora_mm()`. Internal model-loading + graph-building changes, no project changes required |
| ~b9284–b9297 | `ggml/src/ggml-backend.cpp` | Bug fix in `ggml_backend_tensor_get_2d_async`: fast-path condition checked `iface.set_tensor_2d_async == NULL` (typo) instead of `iface.get_tensor_2d_async == NULL`; multi-copy gets now correctly fall back when the backend lacks `get_tensor_2d_async`. Also corrects an out-of-bounds assertion message from "write" to "read". Internal backend code, no project changes required |
| ~b9284–b9297 | `ggml/src/ggml-opencl/` (`ggml-opencl.cpp` + 17 kernel files) | Adreno MoE pipeline bug fix: GEMM/GEMV kernels for MXFP4/Q4_0/Q4_1/Q4_K/Q5_0/Q5_1/Q5_K/Q6_K had a boundary-check race where the `ne01` bounds check exited threads early and prevented their participation in tile-wide reductions, causing wrong results when `ne01 % 64 != 0`. Fixed by: (1) rounding `global_size[0]` up to the next multiple of 64 in `ggml_cl_mul_mat_id`, (2) moving the per-thread `ne01` early-return in each GEMM kernel to AFTER the tile reduction, (3) adding the same early-return in the GEMV kernels and the cvt.cl trans4_ns/restore_ns kernels; alignment threshold also relaxed from `ne01 % 64 == 0` to `ne01 % 32 == 0` in `use_adreno_moe_kernels`. Internal OpenCL backend, affects the `opencl-android-aarch64` classifier build only — no project source changes |
| ~b9284–b9297 | `ggml/src/ggml-sycl/` (`ggml-sycl.cpp`, `dmmv.cpp`, `gated_delta_net.cpp`, `common.hpp`) | (1) BF16 added to `ggml_sycl_supports_dmmv()` and `can_use_dequantize_mul_mat_vec()`; new `convert_mul_mat_vec_bf16_sycl` path. (2) Level Zero auto-detect moved into `ggml_sycl_init()` — `info.ext_oneapi_level_zero` flag now reflects the GPU-only check (CPU devices ignored) and is used as the default for `GGML_SYCL_ENABLE_LEVEL_ZERO` env. (3) `mmid_counting_sort_rows()` replaces the per-expert atomic scan in `ggml_sycl_mul_mat_id` — host-side counting sort builds expert-contiguous row slices in a single pass instead of N×expert atomic scans; significant speedup for MoE dispatch. (4) Gated-delta-net kernel extended with `keep_rs_t` template parameter and per-token snapshot writes when `K > 1`, matching the CUDA/Vulkan snapshot changes from b9222. Internal SYCL backend, no project changes required |
| ~b9284–b9297 | `ggml/src/ggml-vulkan/CMakeLists.txt` | `find_package(SPIRV-Headers)` switched to `CONFIG REQUIRED` and adds `$ENV{VULKAN_SDK}` to `CMAKE_PREFIX_PATH`; fixes detection when SPIRV-Headers ships only the CMake-config files (no FindSPIRV-Headers.cmake). Internal Vulkan build config, no project changes required |
| ~b9284–b9297 | `ggml/src/ggml-zendnn/` (`CMakeLists.txt`, `ggml-zendnn.cpp`) | ZenDNN bumped to ZenDNN-2026-WW19; Q8_0 weight support added for matmul and matmul_id paths via dynamic quantization (S8 compute, BF16 scales); ZenDNN matmul/matmul_id now handles `GGML_TYPE_Q8_0` with FP32 src1 directly without F32→Q8_0 conversion. Internal AMD ZenDNN backend, no project changes required |
| ~b9284–b9297 | `tools/perplexity/perplexity.cpp` | `log_probs.resize(n_ctx * nv)` widened to `size_t(n_ctx) * nv` to avoid 32-bit overflow on large context sizes. Standalone tool not compiled by project, no impact |

## Build Commands

Expand Down
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)
FetchContent_Declare(
llama.cpp
GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
GIT_TAG b9284
GIT_TAG b9297
)
FetchContent_MakeAvailable(llama.cpp)

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
**Build:**
![Java 11+](https://img.shields.io/badge/Java-11%2B-informational)
![JUnit](https://img.shields.io/badge/tested%20with-JUnit4-yellow)
[![llama.cpp b9284](https://img.shields.io/badge/llama.cpp-%23b9284-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9284)
[![llama.cpp b9297](https://img.shields.io/badge/llama.cpp-%23b9297-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9297)
[![Publish](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/publish.yml/badge.svg)](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/publish.yml)
[![CodeQL](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/codeql.yml/badge.svg)](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/codeql.yml)

Expand Down
Loading