Skip to content

CI updates#1390

Merged
martindevans merged 9 commits into
SciSharp:masterfrom
m0nsky:fix/ci-update-binaries-failures
May 24, 2026
Merged

CI updates#1390
martindevans merged 9 commits into
SciSharp:masterfrom
m0nsky:fix/ci-update-binaries-failures

Conversation

@m0nsky
Copy link
Copy Markdown
Contributor

@m0nsky m0nsky commented May 24, 2026

This PR does a couple of things:

  • Disable webui build (-DLLAMA_BUILD_UI=OFF, which fired a npm install/build) that was causing OOM
  • Disable app build (-DLLAMA_BUILD_APP=OFF, which links against llama-server-impl and llama-cli-impl on Android, which we weren't building, which was causing a linker error)
  • Disable examples/server builds (DLLAMA_BUILD_EXAMPLES=OFF -DLLAMA_BUILD_SERVER=OFF, which we weren't using)
  • Properly set num proc for Windows (env:NUMBER_OF_PROCESSORS), Linux (nproc) and macOS (sysctl -n hw.logicalcpu) resulting in faster build times (previously, this resulted in an empty string for macOS, and we weren't even doing this for cublas!)
  • Suppress upstream CUDA warnings (which were causing 381k lines log files on on the windows cuda runner, making it hard to debug)
  • Set cmake targets to build only the shared libraries that llamasharp uses (ggml, ggml-base, ggml-cpu/cuda/vulkan, llama + mtmd)

This results in the complete binary update workflow going from 2h 55m -> 1h 23m

Job Before After Saved Change
Linux (noavx) 4m 32s 2m 12s 2m 20s -51%
Linux (avx) 4m 42s 2m 21s 2m 21s -50%
Linux (avx2) 4m 49s 2m 12s 2m 37s -54%
Linux (avx512) 4m 44s 1m 53s 2m 51s -60%
Linux (aarch64) 4m 41s 1m 39s 3m 2s -65%
musl (noavx) 7m 34s 3m 18s 4m 16s -56%
musl (avx) 7m 14s 3m 12s 4m 2s -56%
musl (avx2) 7m 33s 3m 27s 4m 6s -54%
musl (avx512) 7m 26s 3m 3s 4m 23s -59%
Windows (noavx) 7m 13s 3m 26s 3m 47s -52%
Windows (avx) 6m 37s 3m 14s 3m 23s -51%
Windows (avx2) 6m 14s 3m 13s 3m 1s -48%
Windows (avx512) 6m 11s 3m 22s 2m 49s -46%
Windows ARM64 4m 22s 2m 40s 1m 42s -39%
Vulkan (Linux) 8m 30s 5m 53s 2m 37s -31%
Vulkan (Windows) 11m 26s 8m 36s 2m 50s -25%
cublas (Linux) 2h 7m 24s 59m 33s 1h 7m 51s -53%
cublas (Windows) 2h 53m 46s 1h 22m 25s 1h 31m 21s -53%
macOS (arm64) 25m 38s 2m 28s 23m 10s -90%
macOS (x64) 31m 55s 2m 47s 29m 8s -91%
macOS (x64-rosetta2) 22m 56s 2m 14s 20m 42s -90%
Android (arm64-v8a) 4m 34s 2m 32s 2m 2s -45%
Android (x86_64) 4m 40s 2m 43s 1m 57s -42%
Gather Binaries 1m 20s 1m 22s
Total 2h 55m 13s 1h 23m 56s 1h 31m 17s -52%

Completed build run:
https://github.com/m0nsky/LLamaSharp/actions/runs/26358923261

m0nsky added 9 commits May 24, 2026 09:26
… syntax

llama.cpp now builds an embedded Web UI (npm install + build) by default,
which combined with unlimited parallel compilation exhausts the ~7GB RAM
on GitHub ubuntu-22.04 runners. Disable it with -DLLAMA_BUILD_UI=OFF
since LLamaSharp only needs the shared libraries.

Also fix -j ${env:NUMBER_OF_PROCESSORS} (PowerShell syntax) to -j $(nproc)
in bash steps — the old syntax silently expanded to empty, causing cmake
to use unlimited parallelism.
llama.cpp introduced a unified binary (llama-app) that links against
llama-server-impl and llama-cli-impl. When LLAMA_BUILD_SERVER=OFF
(as set for Android), these libraries aren't built, causing a linker
error. Disable llama-app globally since LLamaSharp only needs the
shared libraries, not the CLI tools.
The macOS step was using PowerShell syntax ${env:NUMBER_OF_PROCESSORS}
which silently expands to empty in bash, resulting in unlimited
parallelism. Use $(sysctl -n hw.logicalcpu) which is the correct
macOS equivalent of nproc.
The cublas build steps had no -j flag, defaulting to single-threaded
compilation. Add -j with the correct platform syntax to parallelize
CUDA kernel compilation.
The nvcc compiler emits thousands of warnings from upstream llama.cpp
CUDA code (e.g. float overflow SciSharp#221-D, unused variables SciSharp#177-D),
repeated for each of the 8 target architectures. On Windows this
produces 381k+ lines of log output, truncating the actual build output.
Suppress with -DCMAKE_CUDA_FLAGS=-w since we don't maintain this code.
LLamaSharp only needs the shared libraries (ggml, llama, mtmd), not the
CLI tools, server, or example binaries. Disable examples and server
globally via COMMON_DEFINE, and remove the now-redundant per-platform
LLAMA_BUILD_SERVER=OFF from Android defines.
Add SUPPRESS_WARNINGS_MSVC (/w) and SUPPRESS_WARNINGS_GNU (-w) env vars
and apply them to all cmake configure steps. These are upstream llama.cpp
warnings we don't maintain — particularly noisy on Windows where MSVC
template instantiation warnings produce hundreds of thousands of log lines.
Instead of building all llama.cpp targets (CLI tools, benchmarks, server,
examples), use cmake --target to build only the shared libraries that
LLamaSharp actually uses: ggml, ggml-base, ggml-cpu/cuda/vulkan, llama,
and mtmd. This skips ~40 unnecessary targets and their dependencies.
@martindevans
Copy link
Copy Markdown
Member

Huge speedups! Thanks for this, it'll make future binary updates a lot less painful. I've test this locally with the binaries from your test run and it worked perfectly.

@martindevans martindevans merged commit 5c5b706 into SciSharp:master May 24, 2026
8 checks passed
@m0nsky m0nsky deleted the fix/ci-update-binaries-failures branch May 24, 2026 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants