-
Notifications
You must be signed in to change notification settings - Fork 16
Description
The idea for GPU implementation on HVM4 is to use deterministic sharding by index and address region. No work stealing, and each pass uses a frontier array. We use region ownership for writes, which means a block only writes inside it's region, and we defer any DUP that require writing outside the region.
Regions
We partition the heap into fixed contiguous regions by address:
region_id = loc / REGION_SIZE
Where each region is assigned to exactly one block. Blocks can read any region, but only write inside their own. This avois cross-block atomics on shared DP slots, allow safe deferral for DUPs.
Sharding
We use a frontier array per pass, and each pass processes the frontier and builds next:
frontier = [root_loc]
while frontier not empty:
clear next_frontier (per region)
launch kernel(frontier, next_frontier)
frontier = concat(next_frontier[all regions])
Ordering is fixed, and the kernel is embarassingly parallel on the frontier array.
Example:
@main = #Pair{#Pair{1, 2}, #Pair{3, 4}}
Conceptual heap:
[0] := Pair (loc1, loc2)
[1] := Pair (loc3, loc4)
[2] := Pair (loc5, loc6]
[3] := 1
[4] := 2
[5] := 3
[6] := 4
Frontier passes:
-
Pass 0:
frontier = [0]
kernel processes loc 0, enqueue children 1, 2.
next frontier = [1, 2] -
Pass 1:
frontier = [1, 2]
kernel processes loc1, enqueue 3, 4
kernel processes loc2, enqueue 5, 6
next frontier = [3,4,5,6] -
Pass 2
frontier = [3, 4, 5, 6]
kernel processes all, enqueue nothing, stops.
BFS-like expansion.
Each thread reduces one task using a local WNF step, and when it needs subterms, enqueues them to next frontier.
Handling DUPs (Option 1)
We use a non-blocking try_take on the expression slot.
If this thread owns the slot's region, it may attempt to take and write SUB. Otherwise, it must defer to the owner if SUB is not already there.
This introduces a synchronization point across blocks for deferral, so it's not a zero communication option.
try_take must be atomic and non-blocking. A non-owner never writes, it defers if SUB is not present. If a non-owner sees SUB, it can read and continue reducing.
Handling DUPs (Option 2)
If we want zero inter block messaging, we can drop the deferral and allow any thread to resolve DUP slots using global atomics. This way, we'd remove the region ownership for DUP slots, and any thread that reaches DP0/DP1 may attempt try_take on the shared expression slot.
The winner writes SUB, losers read SUB and continue. This introduces higher contention on the hot DUP slots and the global atomic.
It trades determinism and locality for zero messaging.
Example: handling DUP with deferral
Term:
@main = ! x &L= &L{1,2}; #Pair{ x0, #Pair{3, x1} }
Heap:
[0] := DUP (expr: 1, body: 2) : Block 0
[1] := SUP (1, 2) : Block 0
[2] := Pair (loc 8, loc 16) : Block 0
[8] := DP0 (expr: 1) : Block 1
[16] := Pair (loc 17, loc 18) : Block 2
[17] := NUM 3 : Block 2
[18] := DP1 (expr: 1) : Block 2
Assuming 3 blocks, 4 threads per block, region_size = 8.
Pass 0 (frontier = [0]):
B0/t0handles DUP, WNF returns body loc 2.- Enqueue children of loc 2 (8, 16)
Pass 1:
frontier_region_1 = [8]
frontier_region_2 = [16]
-
B1/t0handles loc 8 (DP0)- expr loc 1 is owned by
B0andSUBis not present => defer loc 8 to region 0.
- expr loc 1 is owned by
-
B2/t0handles loc 16 (Pair)- enqueue loc 17 (NUM) and loc 18 (DP1)
Pass 2:
frontier_region_0 = [8] // deferred DP0
frontier_region_2 = [18]
B0/t0handles loc 8- owner of loc 1,
try_takesucceeds, writesSUBandDP0reduces to 1.
- owner of loc 1,
B2/t0handles loc 18- reads expr slot loc 1, SUB is there,
DP1reduces to2
- reads expr slot loc 1, SUB is there,
If B2 reaches loc 18 before B0 resolves loc 8, it would defer that loc again to B0, which would resolve it in the next pass.
There is no cross-block spinning.
FFI and primitives
Inside of %gnf, if a primitive is encountered, abort the GPU execution for simplicity.
GNF and NF semantics
%gnf:
When evaluating a program, CPU evaluates up to %gnf, then suspends until GPU finishes.
GPU runs wnf/snf on the given term with exclusive access to the heap. When completes, GPU resumes with the updated heap.
Nested %gnf calls are flattened (if already in GPU, threads as no op).
%nf:
Runs on CPU only. If called while the CPU is already running SNF with a thread pool, I think ideally we want to reuse the existing pool, because spawning a new one will mess up with memory.
Metal specifics
Metal does not support GPU-side kernel launches or grid-wide barriers, so the GPU runs one pass per dispatch and the CPU orchestrates the global loop:
frontier = [root_loc]
while frontier not empty:
clear next_frontier (per region)
dispatch step_frontier(frontier, next_frontier)
wait for completion
frontier = concat(next_frontier[all regions])
To keep region deferral deterministic in a single dispatch, each frontier entry can be packed with its owner region:
entry = (owner_region << OWNER_SHIFT) | loc
The kernel decodes owner_region from the entry and uses it when deciding whether a DP0/DP1 can try_take (owner only) or must defer. This avoids per-region dispatch loops while preserving the “owner writes” rule.
Atomics:
- Use 64-bit atomics on M4-class GPUs (
atomic_ulong) for heap cells and DUP slots.
Buffers and memory:
- Prototype with
MTLStorageModeSharedfor simplicity. - For performance, move heap/frontier buffers to
MTLStorageModePrivateand use explicit staging.