Skip to content

fix: add missing __syncthreads in delta-net CUDA kernel#1649

Merged
ikawrakow merged 1 commit into
ikawrakow:mainfrom
markaalonzo:fix/delta-net-syncthreads
Apr 17, 2026
Merged

fix: add missing __syncthreads in delta-net CUDA kernel#1649
ikawrakow merged 1 commit into
ikawrakow:mainfrom
markaalonzo:fix/delta-net-syncthreads

Conversation

@markaalonzo
Copy link
Copy Markdown
Contributor

Summary

delta_net_recurrent_f32 has a shared-memory race: the token loop reads sK[] at the bottom (during the state update) but there's no barrier before the next iteration overwrites sK[] at the top. Without a fence, hardware/compiler reordering can produce non-deterministic reads and vary results run-to-run.

This PR adds the missing __syncthreads() inside the loop body and documents the already-present barrier that was flagged as "I don't understand why we need this sync" with its actual role.

Diff

 ggml/src/ggml-cuda/delta-net.cu
-    // To be honest, I don't understand why we need this sync. But without it I observe results varying from run to run
+    // Barrier: shared memory fence ensures all_sum1/all_sum2 reads above complete
+    // before the next iteration overwrites them, and ensures sK visibility for
+    // the state update below. Without this fence, hardware/compiler reordering
+    // causes non-deterministic reads from shared memory.
     __syncthreads();

     ...

         state_local[i] = new_state_val;
     }

+    __syncthreads();  // Barrier: sK reads in state update must complete before next iteration overwrites sK
 }

Why it matters

The existing barrier was correctly placed but mis-attributed ("I don't understand why we need this sync"). What was actually missing is a second barrier at the bottom of the loop: the state-update code reads sK[], and without a sync before the next loop iteration, the compiler or hardware can reorder loads against the top-of-loop writes that overwrite sK[]. The reporter's empirical observation ("results varying from run to run") is precisely the signature of this race.

Testing

  • Verified inference produces deterministic outputs across runs on the affected models (Qwen3.5-35B-A3B via TurboQuant, RTX 3080 Ti).
  • No performance regression measured — GPU utilization is 31-33% during inference, bottlenecked by CPU MoE expert computation, not this kernel.

Review complexity

Low — 2-line functional change plus a comment reflow in a single CUDA kernel.

Related context

The delta-net kernel has seen multiple recent PRs (#1315, #1333, #1335, #1337, #1339, #1361, #1362, #1429) focused on fusion and tweaks. None of them touched this specific race. Given the kernel's active churn, formalizing the barrier contract now reduces the chance of future regressions.

Comment thread ggml/src/ggml-cuda/delta-net.cu Outdated
state_local[i] = new_state_val;
}

__syncthreads(); // Barrier: sK reads in state update must complete before next iteration overwrites sK
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. What I failed to see is that sK[col] is being used in the update of state_local, and that can be overwritten by another thread already running the next iteration.

But once the barrier on line 163 is added, the other two barriers (line 148 and line 165) are no longer required. I have tested that this is true, but you can do your own testing as well.

@ghost
Copy link
Copy Markdown

ghost commented Apr 17, 2026

Very cool. PPL improved, and it's now deterministic.

67268c8 (few days old):

[1]2.2252.[2]3.0123.[3]3.0591 [4]3.3337,
Final estimate: PPL over 4 chunks for n ctx=65536 = 3.3337 +/- 0.01853

(Best result of three runs)

This PR:

[1]2.2226,[2]3.0096,[3]3.0564,[4]3.3320,
Final estimate: PPL over 4 chunks for n_ctx=65536 = 3.3320 +/- 0.01851

(Did two runs, same results)

      -c 65536
      --ctx-checkpoints 8
      -b 8192
      -ub 8192
      -ctk q8_0 -ctv q8_0
      -khad -vhad
      --threads 7 --threads-batch 7 -ngl 95
      -cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0

By the way, is there a reason why the final PPL is equal to the PPL of the last chunk only? Shouldn't it be an average of the 4 chunks or something?

@ikawrakow
Copy link
Copy Markdown
Owner

By the way, is there a reason why the final PPL is equal to the PPL of the last chunk only? Shouldn't it be an average of the 4 chunks or something?

Yes, there is. What you see after each chunk is the running average of this chunk and all preceding chunks. So, per definition, the result printed for the final chunk is the final result.

The token loop reads sK[] in the state update (bottom of loop) but has
no barrier before the next iteration overwrites sK[] (top of loop).
Without an explicit memory fence, hardware/compiler reordering can
cause non-deterministic reads from shared memory.

Per review: with this barrier in place, the prior __syncthreads() after
the cross-warp reduction and the one immediately after loop exit are
both redundant. The new barrier is a full block-level fence that also
orders all_sum1/all_sum2 reads vs. the next iteration's writes, and
every thread reaches it before leaving the loop. Both redundant
barriers removed.

No performance impact — GPU utilization is 31-33% during inference,
bottlenecked by CPU MoE expert computation, not the CUDA kernel.
@markaalonzo markaalonzo force-pushed the fix/delta-net-syncthreads branch from 2db55a1 to db985c5 Compare April 17, 2026 19:11
@ikawrakow ikawrakow merged commit 52efa12 into ikawrakow:main Apr 17, 2026
markaalonzo added a commit to markaalonzo/ik_llama.cpp that referenced this pull request Apr 17, 2026
The token loop reads sK[] in the state update (bottom of loop) but has
no barrier before the next iteration overwrites sK[] (top of loop).
Without an explicit memory fence, hardware/compiler reordering can
cause non-deterministic reads from shared memory.

Per review: with this barrier in place, the prior __syncthreads() after
the cross-warp reduction and the one immediately after loop exit are
both redundant. The new barrier is a full block-level fence that also
orders all_sum1/all_sum2 reads vs. the next iteration's writes, and
every thread reaches it before leaving the loop. Both redundant
barriers removed.

No performance impact — GPU utilization is 31-33% during inference,
bottlenecked by CPU MoE expert computation, not the CUDA kernel.

Co-authored-by: Mark Alonzo <mark.alonzo@outlook.com>
SamuelOliveirads pushed a commit to SamuelOliveirads/ik_llama.cpp that referenced this pull request Apr 27, 2026
The token loop reads sK[] in the state update (bottom of loop) but has
no barrier before the next iteration overwrites sK[] (top of loop).
Without an explicit memory fence, hardware/compiler reordering can
cause non-deterministic reads from shared memory.

Per review: with this barrier in place, the prior __syncthreads() after
the cross-warp reduction and the one immediately after loop exit are
both redundant. The new barrier is a full block-level fence that also
orders all_sum1/all_sum2 reads vs. the next iteration's writes, and
every thread reaches it before leaving the loop. Both redundant
barriers removed.

No performance impact — GPU utilization is 31-33% during inference,
bottlenecked by CPU MoE expert computation, not the CUDA kernel.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants