sync : llama.cpp #1311

ggerganov · 2025-07-24T17:29:12Z

No description provided.

…74) (llama/14707)

…316) * ggml/ggml-vulkan/test-backend-ops: adds CONV_2D for Vulkan * ggml-vulkan: adds f32 scalar shader to compute 2D convolution directly with gemm (no need for im2col), * test-backend-ops: adds test_case_ref to check the validity/performance of ops against reference implementations having different graphs, adds tests * * Performance fixes: minimized branch divergence, uses collectives to eliminate redundant calculation, macros removed. * Kernel shared memory size check * Updates test-backend-ops to support graphs for performance measurement. * * Apple/Win32 compile errors fixed * Subgroup size used to determine tile size -> fixes llvmpipe errors. * Collectives disabled by default. * Intel support is disabled as the performance is poor. * Conv2d enabled for Intel with disabled collectives, disabled for Apple * test-backend-ops modifications are reverted * Trailing spaces and missing override fixed. * Triggering pipeline relaunch. * Code formatted with .clang-format.

The tid is decomposed into "ow + ky*OW + kx*OW*KH". Change "ksize" to match.

* kleidiai: add support for get_rows * apply fixes based on code review * apply more fixes based on code review

* add conv2d kernel * fix trailing whitespace * whitespace fixe * handle f16 input and f16 kernel, more opt * resolve conflicts * use enqueue_ndrange_kernel

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* implement bf16 cpy ops and enable bf16 cont * deduplicate copy functions * deduplicate checks

* weight format to nz for 310p * remove quant weight format to nz * clean code * fix * make the conditions for converting weights to NZ format consistent * clean code

ggml-ci

* CUDA: fix quantized KV cache + multiple sequences * Update src/ggml-cuda/fattn-common.cuh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* metal : fix fusion across different encoders ggml-ci * cont : add assertion ggml-ci

ggml-ci

0cc4m and others added 23 commits July 24, 2025 20:28

Vulkan: Fix fprintf format-security warning (llama/14770)

1d1866d

vulkan: Add logging for bf16 features to ggml_vk_print_gpu_info (#132…

c5262c2

…74) (llama/14707)

vulkan/cuda: Fix im2col when KW!=KH (llama/14789)

2743522

The tid is decomposed into "ow + ky*OW + kx*OW*KH". Change "ksize" to match.

kleidiai: add support for get_rows (llama/14676)

aa65fde

* kleidiai: add support for get_rows * apply fixes based on code review * apply more fixes based on code review

sycl: Fix im2col (llama/14797)

de18e9a

opencl: add conv2d kernel (llama/14403)

58f4832

* add conv2d kernel * fix trailing whitespace * whitespace fixe * handle f16 input and f16 kernel, more opt * resolve conflicts * use enqueue_ndrange_kernel

opencl: fix im2col when KW!=KH (llama/14803)

64088bb

cuda: remove linking to cublasLt (llama/14790)

5485663

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

opencl: remove unreachable return (llama/14806)

5b97e8e

cuda : implement bf16 cpy ops and enable bf16 cont (llama/14763)

1dad821

* implement bf16 cpy ops and enable bf16 cont * deduplicate copy functions * deduplicate checks

vulkan: fix rms_norm_mul to handle broadcasting dim0 (llama/14817)

6bc9c97

CUDA: add fused rms norm (llama/14800)

b06d9cb

CANN: weight format to NZ for Ascend310P3 (llama/14407)

0ed1969

* weight format to nz for 310p * remove quant weight format to nz * clean code * fix * make the conditions for converting weights to NZ format consistent * clean code

ggml: fix loongarch quantize_row_q8_1 error (llama/14827)

a44689f

tests : add non-cont K,V FA tests

d495158

ggml-ci

CUDA: fix quantized KV cache + multiple sequences (llama/14822)

e27e2cd

* CUDA: fix quantized KV cache + multiple sequences * Update src/ggml-cuda/fattn-common.cuh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

CUDA: fix compilation with GGML_CUDA_F16 (llama/14837)

21c3ebd

CUDA: fix overflow in FA, tune performance (llama/14840)

5d1cc39

sycl: fix undefined variable in work group size check (llama/14843)

1d54f61

metal : fix fusion across different encoders (llama/14849)

e282986

* metal : fix fusion across different encoders ggml-ci * cont : add assertion ggml-ci

sycl: fixed semantics of block offset calculation (llama/14814)

8dcd3dc

sync : llama.cpp

ee456e8

ggml-ci

danbev approved these changes Jul 24, 2025

View reviewed changes

ggerganov mentioned this pull request Jul 24, 2025

Extend test case filtering (#1308) #1309

Closed

ggerganov merged commit ac84267 into master Jul 24, 2025
15 checks passed

ggerganov deleted the sync-llama.cpp-25-07-24 branch July 24, 2025 17:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sync : llama.cpp #1311

sync : llama.cpp #1311

Uh oh!

ggerganov commented Jul 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

sync : llama.cpp #1311

sync : llama.cpp #1311

Uh oh!

Conversation

ggerganov commented Jul 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants