fix: add missing __syncthreads in delta-net CUDA kernel by markaalonzo · Pull Request #1649 · ikawrakow/ik_llama.cpp

markaalonzo · 2026-04-17T03:07:14Z

Summary

delta_net_recurrent_f32 has a shared-memory race: the token loop reads sK[] at the bottom (during the state update) but there's no barrier before the next iteration overwrites sK[] at the top. Without a fence, hardware/compiler reordering can produce non-deterministic reads and vary results run-to-run.

This PR adds the missing __syncthreads() inside the loop body and documents the already-present barrier that was flagged as "I don't understand why we need this sync" with its actual role.

Diff

 ggml/src/ggml-cuda/delta-net.cu
-    // To be honest, I don't understand why we need this sync. But without it I observe results varying from run to run
+    // Barrier: shared memory fence ensures all_sum1/all_sum2 reads above complete
+    // before the next iteration overwrites them, and ensures sK visibility for
+    // the state update below. Without this fence, hardware/compiler reordering
+    // causes non-deterministic reads from shared memory.
     __syncthreads();

     ...

         state_local[i] = new_state_val;
     }

+    __syncthreads();  // Barrier: sK reads in state update must complete before next iteration overwrites sK
 }

Why it matters

The existing barrier was correctly placed but mis-attributed ("I don't understand why we need this sync"). What was actually missing is a second barrier at the bottom of the loop: the state-update code reads sK[], and without a sync before the next loop iteration, the compiler or hardware can reorder loads against the top-of-loop writes that overwrite sK[]. The reporter's empirical observation ("results varying from run to run") is precisely the signature of this race.

Testing

Verified inference produces deterministic outputs across runs on the affected models (Qwen3.5-35B-A3B via TurboQuant, RTX 3080 Ti).
No performance regression measured — GPU utilization is 31-33% during inference, bottlenecked by CPU MoE expert computation, not this kernel.

Review complexity

Low — 2-line functional change plus a comment reflow in a single CUDA kernel.

Related context

The delta-net kernel has seen multiple recent PRs (#1315, #1333, #1335, #1337, #1339, #1361, #1362, #1429) focused on fusion and tweaks. None of them touched this specific race. Given the kernel's active churn, formalizing the barrier contract now reduces the chance of future regressions.

ikawrakow · 2026-04-17T05:23:44Z

            state_local[i] = new_state_val;
        }

+        __syncthreads();  // Barrier: sK reads in state update must complete before next iteration overwrites sK


Thank you. What I failed to see is that sK[col] is being used in the update of state_local, and that can be overwritten by another thread already running the next iteration.

But once the barrier on line 163 is added, the other two barriers (line 148 and line 165) are no longer required. I have tested that this is true, but you can do your own testing as well.

ghost · 2026-04-17T08:14:50Z

Very cool. PPL improved, and it's now deterministic.

67268c8 (few days old):

[1]2.2252.[2]3.0123.[3]3.0591 [4]3.3337,
Final estimate: PPL over 4 chunks for n ctx=65536 = 3.3337 +/- 0.01853

(Best result of three runs)

This PR:

[1]2.2226,[2]3.0096,[3]3.0564,[4]3.3320,
Final estimate: PPL over 4 chunks for n_ctx=65536 = 3.3320 +/- 0.01851

(Did two runs, same results)

      -c 65536
      --ctx-checkpoints 8
      -b 8192
      -ub 8192
      -ctk q8_0 -ctv q8_0
      -khad -vhad
      --threads 7 --threads-batch 7 -ngl 95
      -cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0

By the way, is there a reason why the final PPL is equal to the PPL of the last chunk only? Shouldn't it be an average of the 4 chunks or something?

ikawrakow · 2026-04-17T08:20:04Z

By the way, is there a reason why the final PPL is equal to the PPL of the last chunk only? Shouldn't it be an average of the 4 chunks or something?

Yes, there is. What you see after each chunk is the running average of this chunk and all preceding chunks. So, per definition, the result printed for the final chunk is the final result.

The token loop reads sK[] in the state update (bottom of loop) but has no barrier before the next iteration overwrites sK[] (top of loop). Without an explicit memory fence, hardware/compiler reordering can cause non-deterministic reads from shared memory. Per review: with this barrier in place, the prior __syncthreads() after the cross-warp reduction and the one immediately after loop exit are both redundant. The new barrier is a full block-level fence that also orders all_sum1/all_sum2 reads vs. the next iteration's writes, and every thread reaches it before leaving the loop. Both redundant barriers removed. No performance impact — GPU utilization is 31-33% during inference, bottlenecked by CPU MoE expert computation, not the CUDA kernel.

The token loop reads sK[] in the state update (bottom of loop) but has no barrier before the next iteration overwrites sK[] (top of loop). Without an explicit memory fence, hardware/compiler reordering can cause non-deterministic reads from shared memory. Per review: with this barrier in place, the prior __syncthreads() after the cross-warp reduction and the one immediately after loop exit are both redundant. The new barrier is a full block-level fence that also orders all_sum1/all_sum2 reads vs. the next iteration's writes, and every thread reaches it before leaving the loop. Both redundant barriers removed. No performance impact — GPU utilization is 31-33% during inference, bottlenecked by CPU MoE expert computation, not the CUDA kernel. Co-authored-by: Mark Alonzo <mark.alonzo@outlook.com>

The token loop reads sK[] in the state update (bottom of loop) but has no barrier before the next iteration overwrites sK[] (top of loop). Without an explicit memory fence, hardware/compiler reordering can cause non-deterministic reads from shared memory. Per review: with this barrier in place, the prior __syncthreads() after the cross-warp reduction and the one immediately after loop exit are both redundant. The new barrier is a full block-level fence that also orders all_sum1/all_sum2 reads vs. the next iteration's writes, and every thread reaches it before leaving the loop. Both redundant barriers removed. No performance impact — GPU utilization is 31-33% during inference, bottlenecked by CPU MoE expert computation, not the CUDA kernel.

ikawrakow requested changes Apr 17, 2026

View reviewed changes

markaalonzo force-pushed the fix/delta-net-syncthreads branch from 2db55a1 to db985c5 Compare April 17, 2026 19:11

ikawrakow approved these changes Apr 17, 2026

View reviewed changes

ikawrakow merged commit 52efa12 into ikawrakow:main Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add missing __syncthreads in delta-net CUDA kernel#1649

fix: add missing __syncthreads in delta-net CUDA kernel#1649
ikawrakow merged 1 commit into
ikawrakow:mainfrom
markaalonzo:fix/delta-net-syncthreads

markaalonzo commented Apr 17, 2026

Uh oh!

ikawrakow Apr 17, 2026

Uh oh!

ghost commented Apr 17, 2026

Uh oh!

ikawrakow commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

markaalonzo commented Apr 17, 2026

Summary

Diff

Why it matters

Testing

Review complexity

Related context

Uh oh!

ikawrakow Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

ghost commented Apr 17, 2026

Uh oh!

ikawrakow commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants