fix: add missing __syncthreads in delta-net CUDA kernel#1649
Conversation
| state_local[i] = new_state_val; | ||
| } | ||
|
|
||
| __syncthreads(); // Barrier: sK reads in state update must complete before next iteration overwrites sK |
There was a problem hiding this comment.
Thank you. What I failed to see is that sK[col] is being used in the update of state_local, and that can be overwritten by another thread already running the next iteration.
But once the barrier on line 163 is added, the other two barriers (line 148 and line 165) are no longer required. I have tested that this is true, but you can do your own testing as well.
|
Very cool. PPL improved, and it's now deterministic. 67268c8 (few days old): (Best result of three runs) This PR: (Did two runs, same results) By the way, is there a reason why the final PPL is equal to the PPL of the last chunk only? Shouldn't it be an average of the 4 chunks or something? |
Yes, there is. What you see after each chunk is the running average of this chunk and all preceding chunks. So, per definition, the result printed for the final chunk is the final result. |
The token loop reads sK[] in the state update (bottom of loop) but has no barrier before the next iteration overwrites sK[] (top of loop). Without an explicit memory fence, hardware/compiler reordering can cause non-deterministic reads from shared memory. Per review: with this barrier in place, the prior __syncthreads() after the cross-warp reduction and the one immediately after loop exit are both redundant. The new barrier is a full block-level fence that also orders all_sum1/all_sum2 reads vs. the next iteration's writes, and every thread reaches it before leaving the loop. Both redundant barriers removed. No performance impact — GPU utilization is 31-33% during inference, bottlenecked by CPU MoE expert computation, not the CUDA kernel.
2db55a1 to
db985c5
Compare
The token loop reads sK[] in the state update (bottom of loop) but has no barrier before the next iteration overwrites sK[] (top of loop). Without an explicit memory fence, hardware/compiler reordering can cause non-deterministic reads from shared memory. Per review: with this barrier in place, the prior __syncthreads() after the cross-warp reduction and the one immediately after loop exit are both redundant. The new barrier is a full block-level fence that also orders all_sum1/all_sum2 reads vs. the next iteration's writes, and every thread reaches it before leaving the loop. Both redundant barriers removed. No performance impact — GPU utilization is 31-33% during inference, bottlenecked by CPU MoE expert computation, not the CUDA kernel. Co-authored-by: Mark Alonzo <mark.alonzo@outlook.com>
The token loop reads sK[] in the state update (bottom of loop) but has no barrier before the next iteration overwrites sK[] (top of loop). Without an explicit memory fence, hardware/compiler reordering can cause non-deterministic reads from shared memory. Per review: with this barrier in place, the prior __syncthreads() after the cross-warp reduction and the one immediately after loop exit are both redundant. The new barrier is a full block-level fence that also orders all_sum1/all_sum2 reads vs. the next iteration's writes, and every thread reaches it before leaving the loop. Both redundant barriers removed. No performance impact — GPU utilization is 31-33% during inference, bottlenecked by CPU MoE expert computation, not the CUDA kernel.
Summary
delta_net_recurrent_f32has a shared-memory race: the token loop readssK[]at the bottom (during the state update) but there's no barrier before the next iteration overwritessK[]at the top. Without a fence, hardware/compiler reordering can produce non-deterministic reads and vary results run-to-run.This PR adds the missing
__syncthreads()inside the loop body and documents the already-present barrier that was flagged as "I don't understand why we need this sync" with its actual role.Diff
Why it matters
The existing barrier was correctly placed but mis-attributed ("I don't understand why we need this sync"). What was actually missing is a second barrier at the bottom of the loop: the state-update code reads
sK[], and without a sync before the next loop iteration, the compiler or hardware can reorder loads against the top-of-loop writes that overwritesK[]. The reporter's empirical observation ("results varying from run to run") is precisely the signature of this race.Testing
Review complexity
Low — 2-line functional change plus a comment reflow in a single CUDA kernel.
Related context
The delta-net kernel has seen multiple recent PRs (#1315, #1333, #1335, #1337, #1339, #1361, #1362, #1429) focused on fusion and tweaks. None of them touched this specific race. Given the kernel's active churn, formalizing the barrier contract now reduces the chance of future regressions.