Upgrade llama.cpp from b9094 to b9102#121
Merged
Merged
Conversation
Key changes in b9102: - Internal CUDA AllReduce pipeline (no NCCL required, works on Windows/PCIe) - SYCL IM2COL_3D support for Intel GPU backend - Bug fix: backend sampling now correctly tracks cur_p.selected for n_probs - Bug fix: post_sampling_probs now works with backend sampling - n_vocab loading moved to per-model load_arch_hparams() (internal refactor) - httplib 0.43.4: chunk-size security fix (manual hex parsing vs strtoul) - ggml version patch 0.11.0 → 0.11.1 No project-level JNI or Java changes required. https://claude.ai/code/session_01QopdxqEvbkhiaaBRqBzgzc
🔍 PR Review: llama.cpp Upgrade b9094 → b9102❌ CRITICAL ISSUE: CLAUDE.md Version Not UpdatedThe PR shows a diff updating CLAUDE.md, but the file was not actually modified: Current state:
Expected:
✅ CORRECT: Version Updates in Other Files
📋 Required ChangesPlease update CLAUDE.md:
This is straightforward to fix — the detailed changelog text already appears in the PR description. |
| Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI. | ||
|
|
||
| Current llama.cpp pinned version: **b9094** | ||
| Current llama.cpp pinned version: **b9102** |
There was a problem hiding this comment.
b9094 but should be updated to b9102 to match the upgrade in CMakeLists.txt and README.md.
Suggested change
| Current llama.cpp pinned version: **b9102** | |
| Current llama.cpp pinned version: **b9102** |
| @@ -240,6 +240,15 @@ Also review the project `CMakeLists.txt` for build-system-level breaks (e.g. ren | |||
| | ~b9071–b9094 | `tools/server/server-models.h` + `server.cpp` | Router child→parent model info propagation: new `CMD_CHILD_TO_ROUTER_INFO` command; `setup_child_server()` gains `const json & model_info` parameter; new `update_loaded_info()` method; `server_model_meta` gains `loaded_info` field; all internally consistent across compiled upstream sources, no project changes required | | |||
| | ~b9071–b9094 | `common/reasoning-budget.cpp` | Forced token logit no longer set to `+INFINITY`; only competing tokens set to `-INFINITY`; internal sampler behavior change, no project changes required | | |||
| | ~b9071–b9094 | `tools/server/webui/` | Settings registry refactored (`settings-config.ts`/`settings-fields.ts`/`settings-sections.ts` merged into `settings-registry.ts`); MCP route `#/settings/mcp` → `#/mcp-servers`; settings route `/settings/chat/[section]` → `/settings/[[section]]`; UI-only, no project changes required | | |||
There was a problem hiding this comment.
📌 Missing changelog entries for b9094–b9102 range. According to the PR description, 7 new changelog rows should be added here:
- CUDA AllReduce pipeline refactoring (
ggml/src/ggml-cuda/allreduce.cu+allreduce.cuh) - CUDA multi-GPU changes (
ggml/src/ggml-cuda/ggml-cuda.cu) - SYCL im2col support (
ggml/src/ggml-sycl/ggml-sycl.cpp+im2col.cpp+im2col.hpp) - GGML version patch bump (0.11.0 → 0.11.1)
- Sampling bug fix (
common/sampling.cpp) - Server context changes (
tools/server/server-context.cpp) - Model loading refactor (
src/llama-model.cpp) - cpp-httplib security fix (
vendor/cpp-httplib/)
These details are in the PR description and should be preserved in CLAUDE.md for future reference.
| llama.cpp | ||
| GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git | ||
| GIT_TAG b9094 | ||
| GIT_TAG b9102 |
3 tasks
bernardladenthin
pushed a commit
that referenced
this pull request
May 22, 2026
Appends a per-issue Deep-dive analysis block to each of the 9 LIKELY FIXED / PARTIALLY FIXED entries, and adds a top-level Deep-dive verdict guide categorising which issues are confirmable from code inspection, which need one targeted JUnit test, and which genuinely require platform-specific runtime reproduction. Updates the Status overview table for #121 (FIXED for 64-bit Android) and #86 (CUDA jar requires libcudart at runtime, not auto-fallback).
bernardladenthin
pushed a commit
that referenced
this pull request
May 22, 2026
Fetched verbatim text of the LIKELY FIXED / PARTIALLY FIXED issues from github.com/kherud/java-llama.cpp and append a Verification plan section with: (a) a table of new info extracted from each issue body, (b) four concrete JUnit test sketches that would close out #80, #95, #98, #102, (c) a non-unit-testable bucket for #34, #50, #86, #103, #121 with the corresponding action (feature, docs, CI matrix), (d) a recommended PR sequencing. Notable finding: #98's original repro did not call enableEmbedding() at all — the binding never forwarded --embedding to the upstream server-context, so the result_output assertion fired because the embedding pipeline was never initialised. enableEmbedding() now exists in ModelParameters (line 1040), so the fix is essentially code-confirmed; an integration test against nomic-embed-text is optional confirmation.
6 tasks
bernardladenthin
added a commit
that referenced
this pull request
May 22, 2026
) * Enrich open-issues baseline with current-fork status Appends a Status in fork subsection to each of the 37 upstream issues with a verdict, file:line evidence, and next steps; adds a Status overview table summarising verdicts across all issues. * Add deep-dive analysis for likely/partially fixed issues Appends a per-issue Deep-dive analysis block to each of the 9 LIKELY FIXED / PARTIALLY FIXED entries, and adds a top-level Deep-dive verdict guide categorising which issues are confirmable from code inspection, which need one targeted JUnit test, and which genuinely require platform-specific runtime reproduction. Updates the Status overview table for #121 (FIXED for 64-bit Android) and #86 (CUDA jar requires libcudart at runtime, not auto-fallback). * Add verification plan with original-issue research and test sketches Fetched verbatim text of the LIKELY FIXED / PARTIALLY FIXED issues from github.com/kherud/java-llama.cpp and append a Verification plan section with: (a) a table of new info extracted from each issue body, (b) four concrete JUnit test sketches that would close out #80, #95, #98, #102, (c) a non-unit-testable bucket for #34, #50, #86, #103, #121 with the corresponding action (feature, docs, CI matrix), (d) a recommended PR sequencing. Notable finding: #98's original repro did not call enableEmbedding() at all — the binding never forwarded --embedding to the upstream server-context, so the result_output assertion fired because the embedding pipeline was never initialised. enableEmbedding() now exists in ModelParameters (line 1040), so the fix is essentially code-confirmed; an integration test against nomic-embed-text is optional confirmation. --------- Co-authored-by: Claude <noreply@anthropic.com>
bernardladenthin
pushed a commit
that referenced
this pull request
May 22, 2026
Closes the documentation gap for issue #86 (does the CUDA jar fall back to CPU?) and the 32-bit Android tail of #121 (armeabi-v7a not published). The new section enumerates the three published classifiers (default CPU, cuda13-linux-x86-64, opencl-android-aarch64), their backends, target platforms, and runtime requirements. It explicitly states that the CUDA JAR is CUDA-only at runtime — it dlopens libcudart.so.13/libcublas.so.13 and has no automatic CPU fallback — and that Android armeabi-v7a is not shipped as a released artifact. Updates docs/history/49be664_open_issues.md to mark #86 as FIXED-AS-DOCUMENTED and #121 as FIXED (64-bit) with the 32-bit limitation now documented. https://claude.ai/code/session_01R3jVWHsB3zymwAQtj8GT43
5 tasks
bernardladenthin
added a commit
that referenced
this pull request
May 22, 2026
* docs: mark #80/#95/#98/#102 as FIXED now that PR #185 is merged PR #185 (commit cba693c) merged the four regression tests sketched in the 49be664 open-issues verification plan. Update the per-issue blocks, the status overview table, the top-level deep-dive verdict guide, and the recommended-sequencing section to reflect that #80, #95, #98 and #102 are now FIXED (no longer "LIKELY FIXED → FIXED on CI green"). https://claude.ai/code/session_01R3jVWHsB3zymwAQtj8GT43 * docs: add README "Choosing the right classifier" section Closes the documentation gap for issue #86 (does the CUDA jar fall back to CPU?) and the 32-bit Android tail of #121 (armeabi-v7a not published). The new section enumerates the three published classifiers (default CPU, cuda13-linux-x86-64, opencl-android-aarch64), their backends, target platforms, and runtime requirements. It explicitly states that the CUDA JAR is CUDA-only at runtime — it dlopens libcudart.so.13/libcublas.so.13 and has no automatic CPU fallback — and that Android armeabi-v7a is not shipped as a released artifact. Updates docs/history/49be664_open_issues.md to mark #86 as FIXED-AS-DOCUMENTED and #121 as FIXED (64-bit) with the 32-bit limitation now documented. https://claude.ai/code/session_01R3jVWHsB3zymwAQtj8GT43 --------- Co-authored-by: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR upgrades the pinned llama.cpp version from b9094 to b9102, incorporating upstream improvements to CUDA AllReduce pipelines, SYCL im2col support, sampling fixes, and security updates.
Key Changes
ggml_cuda_ar_pipelinestruct and APIs supporting 2-GPU PCIe AllReduce without NCCL (Volta+ / sm70+), with configurable chunked kernel vs copy-engine paths and environment variable tuningggml_sycl_im2col_3dfunction enablingGGML_OP_IM2COL_3Don Intel GPUs with tile-based thread decompositioncommon_sampler_sampleto callset_logitsbefore backend-sampling check and properly scan all tokens incur_p.data, enabling correct post-sampling probabilities with backend samplingn_vocabloading fromllama_model_base::load_hparams()to per-modelload_arch_hparams()implementationsNotable Details
GGML_CUDA_ALLREDUCEenvironment variable overridepost_sampling_probsand filters 0.0-probability tokens from resultshttps://claude.ai/code/session_01QopdxqEvbkhiaaBRqBzgzc