Skip to content

Upgrade to llama.cpp b9022: vision, reasoning, and speculative decoding fixes#102

Merged
bernardladenthin merged 10 commits into
masterfrom
claude/determined-volta-T8AoQ
May 5, 2026
Merged

Upgrade to llama.cpp b9022: vision, reasoning, and speculative decoding fixes#102
bernardladenthin merged 10 commits into
masterfrom
claude/determined-volta-T8AoQ

Conversation

@bernardladenthin
Copy link
Copy Markdown
Owner

Summary

Upgrade the llama.cpp binding to version b9022, adding support for vision models (multimodal projection), reasoning/thinking models (DeepSeek-R1, QwQ), and fixing speculative decoding parameter names to match upstream API changes.

Key Changes

Vision Model Support

  • Added setMmproj(String) and setMmprojUrl(String) to ModelParameters for specifying multimodal projection files
  • Added enableMmprojAuto() and enableMmprojOffload() flags for automatic mmproj detection and GPU offloading
  • Added corresponding MMPROJ_AUTO and MMPROJ_OFFLOAD flags to ModelFlag enum

Reasoning/Thinking Model Support

  • New ReasoningFormat enum with values: NONE, AUTO, DEEPSEEK, DEEPSEEK_LEGACY
  • Added setReasoningFormat(ReasoningFormat) to ModelParameters (model-level default)
  • Added setReasoningFormat(ReasoningFormat) to InferenceParameters (per-request override)
  • Added setReasoningBudget(int) to ModelParameters for default token budget
  • Added setReasoningBudgetTokens(int) to InferenceParameters for per-request budget
  • Added extractChoiceReasoningContent() methods to ChatResponseParser to extract thinking tokens from responses

Speculative Decoding Parameter Fixes

  • Renamed speculative decoding parameters to match upstream llama.cpp b9022 API:
    • --draft-max--spec-draft-n-max
    • --draft-min--spec-draft-n-min
    • --draft-p-min--spec-draft-p-min
    • --ctx-size-draft--spec-draft-ctx-size
    • --device-draft--spec-draft-device
    • --gpu-layers-draft--spec-draft-ngl
    • --model-draft--spec-draft-model

Cache Idle Slots Flag Correction

  • Fixed setClearIdle() to use correct flag names:
    • --clear-idle--cache-idle-slots
    • --no-clear-idle--no-cache-idle-slots

Sampling Enhancements

  • Added setTopNSigma(float) to InferenceParameters for top-n-sigma sampling threshold

Server Test Updates

  • Updated C++ server tests to reflect b9022 behavior: task_params::to_json() now only serializes speculative.type (not n_max, n_min, p_min)
  • Added tests for reasoning_budget_tokens parameter parsing

Implementation Details

  • All new methods follow existing builder pattern conventions
  • ReasoningFormat implements CliArg interface for consistent parameter handling
  • Reasoning format values are JSON-quoted in inference parameters but use plain strings in model parameters
  • Comprehensive test coverage added for all new functionality
  • Updated CLAUDE.md with breaking changes documentation for b8953→b9022 upgrade path

https://claude.ai/code/session_018Xi5jWrcJ257WyCx6C2Cpk

claude added 10 commits May 4, 2026 20:47
Breaking changes in b8962 that affect this project:
- task_params::to_json() drops speculative.n_max/n_min/p_min from output;
  only speculative.type remains. Update test_server.cpp accordingly.

Breaking changes in b8962 that don't affect project code directly:
- struct cpu_params renamed to common_cpu_params (and related functions)
- common_params_speculative restructured with nested sub-structs (.draft.*,
  .ngram_cache.*, .ngram_mod.*, etc.)
- common_arg::is_sparam split into is_sampling + is_spec

New in b8962:
- common_speculative_n_max() / common_speculative_n_min() public API
- CANN backend: fused SwiGLU/GeGLU, softplus, set, cumsum, diag, fill,
  tri, solve_tri ops; improved L2 norm, cross entropy, get/set_rows
- Vulkan: timestamp query sync fix
- WebGPU: Q1_0 quantization support; SSM scan x/B/C overlap handling

https://claude.ai/code/session_018Xi5jWrcJ257WyCx6C2Cpk
Breaking changes in b8982 that don't affect project code directly:
- common_sampler_accept: 3rd param renamed accept_grammar → is_generated;
  semantics broadened so false also skips reasoning budget update
- common_reasoning_budget_init: two overloads merged; prefill_tokens param
  removed; callers feed prefill via llama_sampler_accept() loop after init
- ggml_cuda_op_ssm_conv: new optional bias_add_node param; SSM_CONV+ADD+SILU
  CUDA fusion now supported
- speculative.cpp: p_min confidence check moved before result push (fix:
  low-confidence draft tokens now discarded entirely, not appended then ignored)
- server-context.cpp: n_draft_total accounting moved to generation site (fix)

New in b8982:
- Reasoning budget re-arms on subsequent <think> tags (multi-block support)
- CUDA: flash attention for DKQ=320/DV=256 (Mistral Small 4, GQA=32)
- CUDA: fused SSM_CONV + channel-wise bias ADD + SiLU kernel
- CUDA: NVFP4 native Blackwell MMQ path (unified with MXFP4 via template)
- CUDA: quantize_mmq_fp4_cuda replaces quantize_mmq_mxfp4_cuda (covers both)
- ARM: SVE Q8_0 4x8 GEMM kernel for 256-bit SVE with MATMUL_INT8
- PPC: big-endian / AIX tinyBLAS fallback path
- Vulkan: Q4_K scale extraction rewritten via packed uint32 reads (bug fix)
- WebGPU: flash-attn NONE path guard; subgroup-matrix path gated on capability
- ggml: version patch bumped to 0.10.1; backend-meta AllReduce delay fix;
  RISCV SpacemiT xsmtvdotii extension support
- common/log: singleton intentionally leaked to avoid Windows DLL teardown hang

https://claude.ai/code/session_018Xi5jWrcJ257WyCx6C2Cpk
No project C++ changes required. Key upstream changes:
- CUDA: fixed swapped get/set_tensor_2d_async function pointers
- Vulkan: added dpitch param to buffer write 2d, implements set/get_tensor_2d
- speculative.cpp: checkpoint helpers renamed (draft_ prefix removed), ckpt_size removed
- arg.cpp: CLI typo --spec--draft-p-split → --spec-draft-p-split
- mmap: Windows >2 GB file fix using _ftelli64/_fseeki64
- httplib: bumped to v0.43.2 (Windows FILE_SHARE_WRITE, DNS cancel, mbedTLS fixes)
- server-context: LLAMA_TRACE env variable for slot acceptance tracing
All 413 C++ tests pass.

https://claude.ai/code/session_018Xi5jWrcJ257WyCx6C2Cpk
No project C++ changes required. All 413 C++ unit tests pass.

b8994→b9004 upstream changes (no project impact):
- Vulkan FA: separate k_type/v_type params in coopmat2 pipeline; CREATE_FA_CM2_MIXED macro; new spec constants 12-15 (FaTypeK/FaTypeV/FaBlockBytesK/FaBlockBytesV); DECODEFUNC/NEEDS_INIT_IQ_SHMEM macros removed
- WebGPU: vectorized mul_mat condition fix (removed dst->ne[1] % 4 == 0 guard)
- Hexagon HTP: FA exp2 half-precision option; unary-op non-contiguous tensor fix
- webUI: major Svelte/TypeScript component reorganization (no C++ impact)

https://claude.ai/code/session_018Xi5jWrcJ257WyCx6C2Cpk
No project C++ changes required. All 413 C++ unit tests pass.

b9004→b9016 upstream changes (no project impact):
- llama-io.h: read_i interface refactored (read/read_to → read/read_tensor);
  llama_io_write/read_buffer batch backend tensor ops in destructors
- server-context.cpp: static server_get_checkpoint renamed to
  server_prompt_checkpoint_update (in-place ref param)
- arg.cpp: speculative decoding CLI args renamed to --spec-draft-* prefix;
  env vars renamed LLAMA_ARG_DRAFT_* → LLAMA_ARG_SPEC_DRAFT_*
- ggml-cuda: PCI bus ID via cudaDeviceGetPCIBusId (buffer 16→32 bytes)
- ggml-opencl: Adreno MoE MXFP4 GPU-side router reorder; new ns kernels
- ggml-vulkan: GGML_VK_MAX_NODES macro removed
- ggml-webgpu: row_norm gains GGML_OP_NORM support + type parameterization
- llama-model: rope_yarn_log_mul get_key required flag fixed (false not 0.0f)
- common/chat: extract common_chat_templates_generation_prompt helper

https://claude.ai/code/session_018Xi5jWrcJ257WyCx6C2Cpk
- New ReasoningFormat enum (none/auto/deepseek/deepseek-legacy) mapping to
  the reasoning_format JSON field accepted by the server
- InferenceParameters.setReasoningFormat(ReasoningFormat) — controls how
  thinking tokens from models like DeepSeek-R1 and QwQ are extracted
- InferenceParameters.setReasoningBudgetTokens(int) — caps the number of
  reasoning tokens emitted before the model is forced to its response (-1 = unlimited)
- 4 new C++ tests for reasoning_budget_tokens parsing in params_from_json_cmpl
  (default -1, positive value, zero, explicit -1); total now 417/417 passing

https://claude.ai/code/session_018Xi5jWrcJ257WyCx6C2Cpk
…asoningBudgetTokens

Tests all four ReasoningFormat enum values (none/auto/deepseek/deepseek-legacy)
and the three budget token cases (positive, zero, -1/disabled), matching the
pattern of every other setter in InferenceParameters.

https://claude.ai/code/session_018Xi5jWrcJ257WyCx6C2Cpk
…lots bug

Bug fix:
- ModelFlag.CLEAR_IDLE/NO_CLEAR_IDLE mapped to non-existent --clear-idle /
  --no-clear-idle; corrected to --cache-idle-slots / --no-cache-idle-slots
  (the actual llama.cpp CLI flags since b8841)

New ModelParameters:
- setMmproj(String), setMmprojUrl(String), enableMmprojAuto(),
  enableMmprojOffload() — vision model projection file for LLaVA / Gemma3 /
  Qwen2-VL; previously impossible to configure from Java
- setReasoningFormat(ReasoningFormat) — model-level default reasoning format
- setReasoningBudget(int) — model-level default reasoning token budget
- setSleepIdleSeconds(int) — auto-shutdown after N seconds of idle time
- ModelFlag.MMPROJ_AUTO / MMPROJ_OFFLOAD (31 flags total)

New InferenceParameters:
- setTopNSigma(float) — per-request sigma sampling threshold

New ChatResponseParser:
- extractChoiceReasoningContent(String/JsonNode) — reads
  choices[0].message.reasoning_content so callers can access thinking-model
  reasoning output without parsing raw JSON themselves

Tests: 435 Java tests passing (27 new); 417/417 C++ tests passing

https://claude.ai/code/session_018Xi5jWrcJ257WyCx6C2Cpk
llama.cpp b9016 removed --draft-max and --draft-min: the handler now
unconditionally throws std::invalid_argument at parse time. Calling
setDraftMax() or setDraftMin() (already covered by existing tests but
not exercised in CI without a draft model) caused models to fail to
load with no useful error.

Fix:
- setDraftMax → --spec-draft-n-max  (was --draft-max, removed)
- setDraftMin → --spec-draft-n-min  (was --draft-min, removed)

Also updated still-aliased flags to the canonical --spec-draft-*
names for forward compatibility:
- setDraftPMin → --spec-draft-p-min
- setCtxSizeDraft → --spec-draft-ctx-size
- setDeviceDraft → --spec-draft-device
- setGpuLayersDraft → --spec-draft-ngl
- setModelDraft → --spec-draft-model

Tests updated to expect the new flag names; setDraftMax/setDraftMin
tests now also assert the broken old flag is absent.

https://claude.ai/code/session_018Xi5jWrcJ257WyCx6C2Cpk
@bernardladenthin bernardladenthin merged commit 4773447 into master May 5, 2026
16 checks passed
@bernardladenthin bernardladenthin deleted the claude/determined-volta-T8AoQ branch May 5, 2026 07:54
bernardladenthin pushed a commit that referenced this pull request May 22, 2026
Fetched verbatim text of the LIKELY FIXED / PARTIALLY FIXED issues from
github.com/kherud/java-llama.cpp and append a Verification plan section
with: (a) a table of new info extracted from each issue body, (b) four
concrete JUnit test sketches that would close out #80, #95, #98, #102,
(c) a non-unit-testable bucket for #34, #50, #86, #103, #121 with the
corresponding action (feature, docs, CI matrix), (d) a recommended PR
sequencing.

Notable finding: #98's original repro did not call enableEmbedding()
at all — the binding never forwarded --embedding to the upstream
server-context, so the result_output assertion fired because the
embedding pipeline was never initialised. enableEmbedding() now
exists in ModelParameters (line 1040), so the fix is essentially
code-confirmed; an integration test against nomic-embed-text is
optional confirmation.
bernardladenthin added a commit that referenced this pull request May 22, 2026
)

* Enrich open-issues baseline with current-fork status

Appends a Status in fork subsection to each of the 37 upstream issues with
a verdict, file:line evidence, and next steps; adds a Status overview
table summarising verdicts across all issues.

* Add deep-dive analysis for likely/partially fixed issues

Appends a per-issue Deep-dive analysis block to each of the 9
LIKELY FIXED / PARTIALLY FIXED entries, and adds a top-level Deep-dive
verdict guide categorising which issues are confirmable from code
inspection, which need one targeted JUnit test, and which genuinely
require platform-specific runtime reproduction.

Updates the Status overview table for #121 (FIXED for 64-bit Android)
and #86 (CUDA jar requires libcudart at runtime, not auto-fallback).

* Add verification plan with original-issue research and test sketches

Fetched verbatim text of the LIKELY FIXED / PARTIALLY FIXED issues from
github.com/kherud/java-llama.cpp and append a Verification plan section
with: (a) a table of new info extracted from each issue body, (b) four
concrete JUnit test sketches that would close out #80, #95, #98, #102,
(c) a non-unit-testable bucket for #34, #50, #86, #103, #121 with the
corresponding action (feature, docs, CI matrix), (d) a recommended PR
sequencing.

Notable finding: #98's original repro did not call enableEmbedding()
at all — the binding never forwarded --embedding to the upstream
server-context, so the result_output assertion fired because the
embedding pipeline was never initialised. enableEmbedding() now
exists in ModelParameters (line 1040), so the fix is essentially
code-confirmed; an integration test against nomic-embed-text is
optional confirmation.

---------

Co-authored-by: Claude <noreply@anthropic.com>
bernardladenthin pushed a commit that referenced this pull request May 22, 2026
Updates docs/history/49be664_open_issues.md to reflect that the four
JUnit regression tests called for in the verification plan have been
added on this branch:

- Deep-dive verdict guide now lists each test name and self-skip
  behaviour next to its issue bullet
- Per-issue Status blocks for #80, #95, #98, #102 annotated as
  "LIKELY FIXED -> FIXED on CI green" with the covering test
- Status overview table rows for the same four issues updated
- "What the original issues actually contain" feasibility table marks
  all four as DONE with the commit reference
- "Concrete test plan" gains a status callout noting the as-shipped
  implementation matches the sketches
- "Recommended sequencing" step 1 marked DONE and enumerates what
  shipped; remaining steps (#86 docs, #103/#34 typed image API, Android
  emulator CI) carried forward as the next deliverables

No code or behaviour change, documentation only.

https://claude.ai/code/session_01LR7Gw1pyKS7wvxXfZjnxNW
bernardladenthin added a commit that referenced this pull request May 22, 2026
* test: add JUnit regressions for kherud open issues #80, #95, #98, #102

Adds four small JUnit tests proposed in the verification plan section of
docs/history/49be664_open_issues.md to upgrade the corresponding upstream
issues from LIKELY FIXED to FIXED:

- MemoryManagementTest#testOpenCloseLoopDoesNotLeak (#102) - 20-iteration
  open/close loop; on Linux asserts VmRSS delta < 200 MB. Degenerates to
  a no-crash smoke test on non-Linux hosts where /proc/self/status is
  absent.
- MemoryManagementTest#testOpenCloseWithoutGeneration (#80) - 20 open +
  immediate close without any generation, exercises the half-initialised
  worker race closed by the double server.terminate() in jllama.cpp.
- LlamaModelTest#testIteratorTerminatesOnRepetitivePrompt (#95) - asserts
  the iterator terminates within nPredict+1 steps on a deliberately
  repetitive prompt.
- LlamaEmbeddingsTest#testNomicEmbedLoads (#98) - gated on system
  property net.ladenthin.llama.nomic.path; reproduces the reporter's
  batch/ubatch config plus the fix (enableEmbedding()), and asserts a
  768-dim vector for nomic-embed-text-v1.5.

Wires up the optional nomic GGUF download in the linux-x86_64 Java test
job in .github/workflows/publish.yml. Other test jobs cleanly self-skip
via Assume because the system property is unset.

Documents the local native-build workflow in CLAUDE.md - per-host output
paths, mvn-cmake handoff, optional model handling, and the
restricted-network caveat for environments that block huggingface.co.

https://claude.ai/code/session_01LR7Gw1pyKS7wvxXfZjnxNW

* docs: record #80/#95/#98/#102 regression tests added in 713d426

Updates docs/history/49be664_open_issues.md to reflect that the four
JUnit regression tests called for in the verification plan have been
added on this branch:

- Deep-dive verdict guide now lists each test name and self-skip
  behaviour next to its issue bullet
- Per-issue Status blocks for #80, #95, #98, #102 annotated as
  "LIKELY FIXED -> FIXED on CI green" with the covering test
- Status overview table rows for the same four issues updated
- "What the original issues actually contain" feasibility table marks
  all four as DONE with the commit reference
- "Concrete test plan" gains a status callout noting the as-shipped
  implementation matches the sketches
- "Recommended sequencing" step 1 marked DONE and enumerates what
  shipped; remaining steps (#86 docs, #103/#34 typed image API, Android
  emulator CI) carried forward as the next deliverables

No code or behaviour change, documentation only.

https://claude.ai/code/session_01LR7Gw1pyKS7wvxXfZjnxNW

---------

Co-authored-by: Claude <noreply@anthropic.com>
bernardladenthin added a commit that referenced this pull request May 22, 2026
* docs: mark #80/#95/#98/#102 as FIXED now that PR #185 is merged

PR #185 (commit cba693c) merged the four regression tests sketched in the
49be664 open-issues verification plan. Update the per-issue blocks, the
status overview table, the top-level deep-dive verdict guide, and the
recommended-sequencing section to reflect that #80, #95, #98 and #102
are now FIXED (no longer "LIKELY FIXED → FIXED on CI green").

https://claude.ai/code/session_01R3jVWHsB3zymwAQtj8GT43

* docs: add README "Choosing the right classifier" section

Closes the documentation gap for issue #86 (does the CUDA jar fall back to
CPU?) and the 32-bit Android tail of #121 (armeabi-v7a not published).

The new section enumerates the three published classifiers (default CPU,
cuda13-linux-x86-64, opencl-android-aarch64), their backends, target
platforms, and runtime requirements. It explicitly states that the CUDA
JAR is CUDA-only at runtime — it dlopens libcudart.so.13/libcublas.so.13
and has no automatic CPU fallback — and that Android armeabi-v7a is not
shipped as a released artifact.

Updates docs/history/49be664_open_issues.md to mark #86 as
FIXED-AS-DOCUMENTED and #121 as FIXED (64-bit) with the 32-bit limitation
now documented.

https://claude.ai/code/session_01R3jVWHsB3zymwAQtj8GT43

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants