Skip to content

Upgrade llama.cpp from b9094 to b9102#121

Merged
bernardladenthin merged 1 commit into
mainfrom
claude/loving-heisenberg-HIVmc
May 11, 2026
Merged

Upgrade llama.cpp from b9094 to b9102#121
bernardladenthin merged 1 commit into
mainfrom
claude/loving-heisenberg-HIVmc

Conversation

@bernardladenthin
Copy link
Copy Markdown
Owner

Summary

This PR upgrades the pinned llama.cpp version from b9094 to b9102, incorporating upstream improvements to CUDA AllReduce pipelines, SYCL im2col support, sampling fixes, and security updates.

Key Changes

  • CUDA AllReduce Pipeline Refactoring: New ggml_cuda_ar_pipeline struct and APIs supporting 2-GPU PCIe AllReduce without NCCL (Volta+ / sm70+), with configurable chunked kernel vs copy-engine paths and environment variable tuning
  • SYCL im2col Support: Added ggml_sycl_im2col_3d function enabling GGML_OP_IM2COL_3D on Intel GPUs with tile-based thread decomposition
  • Sampling Bug Fix: Fixed common_sampler_sample to call set_logits before backend-sampling check and properly scan all tokens in cur_p.data, enabling correct post-sampling probabilities with backend sampling
  • Model Loading Refactor: Moved n_vocab loading from llama_model_base::load_hparams() to per-model load_arch_hparams() implementations
  • Security Fix: Updated cpp-httplib (0.43.4) with improved chunk-size parsing to prevent overflow and reject invalid chunk extensions
  • GGML Version Bump: Patch version incremented from 0.11.0 to 0.11.1

Notable Details

  • CUDA AllReduce supports platform-specific defaults (Linux→NCCL, Windows→internal) with GGML_CUDA_ALLREDUCE environment variable override
  • Backend sampling now works correctly with post_sampling_probs and filters 0.0-probability tokens from results
  • HIP/MUSA builds return nullptr stubs for AllReduce pipeline
  • All changes are internal to llama.cpp; no JNI layer modifications required

https://claude.ai/code/session_01QopdxqEvbkhiaaBRqBzgzc

Key changes in b9102:
- Internal CUDA AllReduce pipeline (no NCCL required, works on Windows/PCIe)
- SYCL IM2COL_3D support for Intel GPU backend
- Bug fix: backend sampling now correctly tracks cur_p.selected for n_probs
- Bug fix: post_sampling_probs now works with backend sampling
- n_vocab loading moved to per-model load_arch_hparams() (internal refactor)
- httplib 0.43.4: chunk-size security fix (manual hex parsing vs strtoul)
- ggml version patch 0.11.0 → 0.11.1

No project-level JNI or Java changes required.

https://claude.ai/code/session_01QopdxqEvbkhiaaBRqBzgzc
@claude
Copy link
Copy Markdown

claude Bot commented May 11, 2026

🔍 PR Review: llama.cpp Upgrade b9094 → b9102

❌ CRITICAL ISSUE: CLAUDE.md Version Not Updated

The PR shows a diff updating CLAUDE.md, but the file was not actually modified:

Current state:

  • Line 9 in CLAUDE.md still shows: Current llama.cpp pinned version: **b9094**
  • Changelog entries for the b9094–b9102 range are missing

Expected:

  • Line 9 should be: Current llama.cpp pinned version: **b9102**
  • 7 new changelog entries should be added (CUDA AllReduce pipeline, SYCL im2col, sampling bug fix, server-context changes, model loading refactor, cpp-httplib security fix, ggml version bump)

✅ CORRECT: Version Updates in Other Files

  • CMakeLists.txt (line 100): Correctly pinned to b9102
  • README.md (line 2): Correctly updated badge to b9102

📋 Required Changes

Please update CLAUDE.md:

  1. Change line 9 version from b9094 to b9102
  2. Add the 7 new changelog table entries documented in the PR description

This is straightforward to fix — the detailed changelog text already appears in the PR description.

Comment thread CLAUDE.md
Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.

Current llama.cpp pinned version: **b9094**
Current llama.cpp pinned version: **b9102**
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ This line still shows b9094 but should be updated to b9102 to match the upgrade in CMakeLists.txt and README.md.

Suggested change
Current llama.cpp pinned version: **b9102**
Current llama.cpp pinned version: **b9102**

Comment thread CLAUDE.md
@@ -240,6 +240,15 @@ Also review the project `CMakeLists.txt` for build-system-level breaks (e.g. ren
| ~b9071–b9094 | `tools/server/server-models.h` + `server.cpp` | Router child→parent model info propagation: new `CMD_CHILD_TO_ROUTER_INFO` command; `setup_child_server()` gains `const json & model_info` parameter; new `update_loaded_info()` method; `server_model_meta` gains `loaded_info` field; all internally consistent across compiled upstream sources, no project changes required |
| ~b9071–b9094 | `common/reasoning-budget.cpp` | Forced token logit no longer set to `+INFINITY`; only competing tokens set to `-INFINITY`; internal sampler behavior change, no project changes required |
| ~b9071–b9094 | `tools/server/webui/` | Settings registry refactored (`settings-config.ts`/`settings-fields.ts`/`settings-sections.ts` merged into `settings-registry.ts`); MCP route `#/settings/mcp` → `#/mcp-servers`; settings route `/settings/chat/[section]` → `/settings/[[section]]`; UI-only, no project changes required |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📌 Missing changelog entries for b9094–b9102 range. According to the PR description, 7 new changelog rows should be added here:

  1. CUDA AllReduce pipeline refactoring (ggml/src/ggml-cuda/allreduce.cu + allreduce.cuh)
  2. CUDA multi-GPU changes (ggml/src/ggml-cuda/ggml-cuda.cu)
  3. SYCL im2col support (ggml/src/ggml-sycl/ggml-sycl.cpp + im2col.cpp + im2col.hpp)
  4. GGML version patch bump (0.11.0 → 0.11.1)
  5. Sampling bug fix (common/sampling.cpp)
  6. Server context changes (tools/server/server-context.cpp)
  7. Model loading refactor (src/llama-model.cpp)
  8. cpp-httplib security fix (vendor/cpp-httplib/)

These details are in the PR description and should be preserved in CLAUDE.md for future reference.

Comment thread CMakeLists.txt
llama.cpp
GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
GIT_TAG b9094
GIT_TAG b9102
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Correct: Version pinned to b9102

@bernardladenthin bernardladenthin merged commit 2d87277 into main May 11, 2026
21 checks passed
@bernardladenthin bernardladenthin deleted the claude/loving-heisenberg-HIVmc branch May 11, 2026 10:59
bernardladenthin pushed a commit that referenced this pull request May 22, 2026
Appends a per-issue Deep-dive analysis block to each of the 9
LIKELY FIXED / PARTIALLY FIXED entries, and adds a top-level Deep-dive
verdict guide categorising which issues are confirmable from code
inspection, which need one targeted JUnit test, and which genuinely
require platform-specific runtime reproduction.

Updates the Status overview table for #121 (FIXED for 64-bit Android)
and #86 (CUDA jar requires libcudart at runtime, not auto-fallback).
bernardladenthin pushed a commit that referenced this pull request May 22, 2026
Fetched verbatim text of the LIKELY FIXED / PARTIALLY FIXED issues from
github.com/kherud/java-llama.cpp and append a Verification plan section
with: (a) a table of new info extracted from each issue body, (b) four
concrete JUnit test sketches that would close out #80, #95, #98, #102,
(c) a non-unit-testable bucket for #34, #50, #86, #103, #121 with the
corresponding action (feature, docs, CI matrix), (d) a recommended PR
sequencing.

Notable finding: #98's original repro did not call enableEmbedding()
at all — the binding never forwarded --embedding to the upstream
server-context, so the result_output assertion fired because the
embedding pipeline was never initialised. enableEmbedding() now
exists in ModelParameters (line 1040), so the fix is essentially
code-confirmed; an integration test against nomic-embed-text is
optional confirmation.
bernardladenthin added a commit that referenced this pull request May 22, 2026
)

* Enrich open-issues baseline with current-fork status

Appends a Status in fork subsection to each of the 37 upstream issues with
a verdict, file:line evidence, and next steps; adds a Status overview
table summarising verdicts across all issues.

* Add deep-dive analysis for likely/partially fixed issues

Appends a per-issue Deep-dive analysis block to each of the 9
LIKELY FIXED / PARTIALLY FIXED entries, and adds a top-level Deep-dive
verdict guide categorising which issues are confirmable from code
inspection, which need one targeted JUnit test, and which genuinely
require platform-specific runtime reproduction.

Updates the Status overview table for #121 (FIXED for 64-bit Android)
and #86 (CUDA jar requires libcudart at runtime, not auto-fallback).

* Add verification plan with original-issue research and test sketches

Fetched verbatim text of the LIKELY FIXED / PARTIALLY FIXED issues from
github.com/kherud/java-llama.cpp and append a Verification plan section
with: (a) a table of new info extracted from each issue body, (b) four
concrete JUnit test sketches that would close out #80, #95, #98, #102,
(c) a non-unit-testable bucket for #34, #50, #86, #103, #121 with the
corresponding action (feature, docs, CI matrix), (d) a recommended PR
sequencing.

Notable finding: #98's original repro did not call enableEmbedding()
at all — the binding never forwarded --embedding to the upstream
server-context, so the result_output assertion fired because the
embedding pipeline was never initialised. enableEmbedding() now
exists in ModelParameters (line 1040), so the fix is essentially
code-confirmed; an integration test against nomic-embed-text is
optional confirmation.

---------

Co-authored-by: Claude <noreply@anthropic.com>
bernardladenthin pushed a commit that referenced this pull request May 22, 2026
Closes the documentation gap for issue #86 (does the CUDA jar fall back to
CPU?) and the 32-bit Android tail of #121 (armeabi-v7a not published).

The new section enumerates the three published classifiers (default CPU,
cuda13-linux-x86-64, opencl-android-aarch64), their backends, target
platforms, and runtime requirements. It explicitly states that the CUDA
JAR is CUDA-only at runtime — it dlopens libcudart.so.13/libcublas.so.13
and has no automatic CPU fallback — and that Android armeabi-v7a is not
shipped as a released artifact.

Updates docs/history/49be664_open_issues.md to mark #86 as
FIXED-AS-DOCUMENTED and #121 as FIXED (64-bit) with the 32-bit limitation
now documented.

https://claude.ai/code/session_01R3jVWHsB3zymwAQtj8GT43
bernardladenthin added a commit that referenced this pull request May 22, 2026
* docs: mark #80/#95/#98/#102 as FIXED now that PR #185 is merged

PR #185 (commit cba693c) merged the four regression tests sketched in the
49be664 open-issues verification plan. Update the per-issue blocks, the
status overview table, the top-level deep-dive verdict guide, and the
recommended-sequencing section to reflect that #80, #95, #98 and #102
are now FIXED (no longer "LIKELY FIXED → FIXED on CI green").

https://claude.ai/code/session_01R3jVWHsB3zymwAQtj8GT43

* docs: add README "Choosing the right classifier" section

Closes the documentation gap for issue #86 (does the CUDA jar fall back to
CPU?) and the 32-bit Android tail of #121 (armeabi-v7a not published).

The new section enumerates the three published classifiers (default CPU,
cuda13-linux-x86-64, opencl-android-aarch64), their backends, target
platforms, and runtime requirements. It explicitly states that the CUDA
JAR is CUDA-only at runtime — it dlopens libcudart.so.13/libcublas.so.13
and has no automatic CPU fallback — and that Android armeabi-v7a is not
shipped as a released artifact.

Updates docs/history/49be664_open_issues.md to mark #86 as
FIXED-AS-DOCUMENTED and #121 as FIXED (64-bit) with the 32-bit limitation
now documented.

https://claude.ai/code/session_01R3jVWHsB3zymwAQtj8GT43

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants