latest from upstream #4

ddh0 · 2025-10-14T18:14:13Z

No description provided.

* remove legacy copy-op pointer indirection code * further removal of copy-op indirection code * renamed check_node_graph_compatibility_and_refresh_copy_ops function

* CUDA: kernel for larger batch sizes for MoE * WIP * WIP * WIP * WIP * WIP * WIP * fixup * tests * Move mmq_ids_helper to mmid * cleanup * Remove redundant checks

* CUDA: use fastdiv + ggml_cuda_mad for mmvf * use bf16 directly + fix formatting * Add exception for HIP code

Enable CMP0147 so custom build steps (invoking vulkan-shader-gen) are run in parallel. Enable /MP so source files are compiled in parallel.

…16577)

Signed-off-by: Stefan Savic <stefan.savic@huawei.com> Co-authored-by: Stefan Savic <stefan.savic@huawei.com>

* metal : avoid using Metal's gpuAddress property * metal : fix rope kernels buffer check

* Add buffer label and enable dawn-specific toggles to turn off some checks * Minor set_rows optimization (#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Remove some comments * Implement overlap binary operators * Revert "Implement overlap binary operators" This reverts commit ed710b3. * Disable support for non-contiguous binary_op tensors and leave note for future support --------- Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com> Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>

anavp-nvidia and others added 9 commits October 14, 2025 11:53

cuda : remove legacy copy-op pointer indirection code (#16485)

5b6913c

* remove legacy copy-op pointer indirection code * further removal of copy-op indirection code * renamed check_node_graph_compatibility_and_refresh_copy_ops function

CUDA: add fp kernel for larger batch size MoE (#16512)

48e2fa9

* CUDA: kernel for larger batch sizes for MoE * WIP * WIP * WIP * WIP * WIP * WIP * fixup * tests * Move mmq_ids_helper to mmid * cleanup * Remove redundant checks

CUDA: use fastdiv + ggml_cuda_mad for mmvf (#16557)

1ee9d0b

* CUDA: use fastdiv + ggml_cuda_mad for mmvf * use bf16 directly + fix formatting * Add exception for HIP code

CUDA: enable FA for FP32 KV cache (#16546)

9c7185d

vulkan: Improve build time for MSVC (#16545)

7ea15bb

Enable CMP0147 so custom build steps (invoking vulkan-shader-gen) are run in parallel. Enable /MP so source files are compiled in parallel.

vulkan: Support FA with K/V in F32 (#16543)

4258e0c

CUDA + openCL: fix bug in accessing rms_norm->src while doing fusion (#…

120bf70

…16577)

vulkan: Add ACC_TYPE_VEC2 implementation (#16203)

ffa0590

Signed-off-by: Stefan Savic <stefan.savic@huawei.com> Co-authored-by: Stefan Savic <stefan.savic@huawei.com>

metal : avoid using Metal's gpuAddress property (#16576)

fa882fd

* metal : avoid using Metal's gpuAddress property * metal : fix rope kernels buffer check

ddh0 merged commit e8831e0 into ddh0:glm45v Oct 14, 2025
101 of 135 checks passed

github-actions bot added Apple Metal Nvidia GPU Vulkan testing ggml OpenCL labels Oct 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

latest from upstream #4

latest from upstream #4

Uh oh!

ddh0 commented Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

latest from upstream #4

latest from upstream #4

Uh oh!

Conversation

ddh0 commented Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants