Port mapreduce optimization from CUDA #627

christiangnrd · 2025-07-20T18:56:13Z

Depends on #626

This is so that benchmarking against AK implementations doesn't show big discrepancy on cuda vs on metal.

github-actions · 2025-07-20T18:56:39Z

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic main) to apply these changes.

Click here to view the suggested changes.

diff --git a/perf/array.jl b/perf/array.jl
index 21cd4677..4df2789c 100644
--- a/perf/array.jl
+++ b/perf/array.jl
@@ -68,7 +68,7 @@ gpu_vec_ints = reshape(gpu_mat_ints, length(gpu_mat_ints))
 
 # 'evals=1' added to prevent hang when running benchmarks of CI
 # TODO: Investigate cause and properly fix.
-group["construct"] = @benchmarkable MtlArray{Int,1}(undef, 1) evals=1
+group["construct"] = @benchmarkable MtlArray{Int, 1}(undef, 1) evals = 1
 
 group["broadcast"] = @benchmarkable Metal.@sync $gpu_mat .= 0f0

github-actions

Metal Benchmarks

Details

Benchmark suite	Current: `e01d410`	Previous: `4b05d30`	Ratio
`latency/precompile`	`9944234125` ns	`9958861667` ns	`1.00`
`latency/ttfp`	`4031131250` ns	`4033351208` ns	`1.00`
`latency/import`	`1296414896` ns	`1297505000` ns	`1.00`
`integration/metaldevrt`	`903042` ns	`902875` ns	`1.00`
`integration/byval/slices=1`	`1635041.5` ns	`1648541` ns	`0.99`
`integration/byval/slices=3`	`19608167` ns	`22029125` ns	`0.89`
`integration/byval/reference`	`1620250` ns	`1623791` ns	`1.00`
`integration/byval/slices=2`	`2772625.5` ns	`2786541` ns	`1.00`
`kernel/indexing`	`468500` ns	`520645.5` ns	`0.90`
`kernel/indexing_checked`	`507333` ns	`518542` ns	`0.98`
`kernel/launch`	`8666.5` ns	`7917` ns	`1.09`
`array/construct`	`5833` ns	`6347.333333333333` ns	`0.92`
`array/broadcast`	`536125` ns	`543229` ns	`0.99`
`array/random/randn/Float32`	`903979` ns	`950624.5` ns	`0.95`
`array/random/randn!/Float32`	`598458` ns	`600750` ns	`1.00`
`array/random/rand!/Int64`	`551083` ns	`551666` ns	`1.00`
`array/random/rand!/Float32`	`552208.5` ns	`549417` ns	`1.01`
`array/random/rand/Int64`	`942833.5` ns	`938188` ns	`1.00`
`array/random/rand/Float32`	`858354` ns	`869917` ns	`0.99`
`array/accumulate/Int64/1d`	`1389916.5` ns	`1410625` ns	`0.99`
`array/accumulate/Int64/dims=1`	`1928916` ns	`1979291` ns	`0.97`
`array/accumulate/Int64/dims=2`	`2308792` ns	`2362542` ns	`0.98`
`array/accumulate/Int64/dims=1L`	`12566812.5` ns	`12465354` ns	`1.01`
`array/accumulate/Int64/dims=2L`	`9999062.5` ns	`10148791` ns	`0.99`
`array/accumulate/Float32/1d`	`1185270.5` ns	`1224854` ns	`0.97`
`array/accumulate/Float32/dims=1`	`1690354.5` ns	`1703208` ns	`0.99`
`array/accumulate/Float32/dims=2`	`2100500` ns	`2167708` ns	`0.97`
`array/accumulate/Float32/dims=1L`	`10556291.5` ns	`10650916.5` ns	`0.99`
`array/accumulate/Float32/dims=2L`	`7568979.5` ns	`7679250` ns	`0.99`
`array/reductions/reduce/Int64/1d`	`1271959` ns	`1263833.5` ns	`1.01`
`array/reductions/reduce/Int64/dims=1`	`1162583` ns	`1165042` ns	`1.00`
`array/reductions/reduce/Int64/dims=2`	`1302646` ns	`1280125` ns	`1.02`
`array/reductions/reduce/Int64/dims=1L`	`2168667` ns	`20815958.5` ns	`0.10`
`array/reductions/reduce/Int64/dims=2L`	`3589625` ns	`3593875` ns	`1.00`
`array/reductions/reduce/Float32/1d`	`806083` ns	`798792` ns	`1.01`
`array/reductions/reduce/Float32/dims=1`	`833792` ns	`852417` ns	`0.98`
`array/reductions/reduce/Float32/dims=2`	`747125` ns	`759000` ns	`0.98`
`array/reductions/reduce/Float32/dims=1L`	`1864958.5` ns	`7077896` ns	`0.26`
`array/reductions/reduce/Float32/dims=2L`	`1891104.5` ns	`1897354.5` ns	`1.00`
`array/reductions/mapreduce/Int64/1d`	`1285667` ns	`1295374.5` ns	`0.99`
`array/reductions/mapreduce/Int64/dims=1`	`1171709` ns	`1157625` ns	`1.01`
`array/reductions/mapreduce/Int64/dims=2`	`1291125` ns	`1294375` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=1L`	`2211625` ns	`20902334` ns	`0.11`
`array/reductions/mapreduce/Int64/dims=2L`	`3597875` ns	`3606709` ns	`1.00`
`array/reductions/mapreduce/Float32/1d`	`790917` ns	`818041.5` ns	`0.97`
`array/reductions/mapreduce/Float32/dims=1`	`834333.5` ns	`852604.5` ns	`0.98`
`array/reductions/mapreduce/Float32/dims=2`	`747625` ns	`749937.5` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=1L`	`1823917` ns	`7111292` ns	`0.26`
`array/reductions/mapreduce/Float32/dims=2L`	`1913167` ns	`1908333.5` ns	`1.00`
`array/private/copyto!/gpu_to_gpu`	`582833` ns	`590083.5` ns	`0.99`
`array/private/copyto!/cpu_to_gpu`	`728708` ns	`711041` ns	`1.02`
`array/private/copyto!/gpu_to_cpu`	`750708` ns	`770062.5` ns	`0.97`
`array/private/iteration/findall/int`	`1626750` ns	`1686125` ns	`0.96`
`array/private/iteration/findall/bool`	`1558291.5` ns	`1580166` ns	`0.99`
`array/private/iteration/findfirst/int`	`1836041` ns	`1852416` ns	`0.99`
`array/private/iteration/findfirst/bool`	`1727083` ns	`1757666.5` ns	`0.98`
`array/private/iteration/scalar`	`3240167` ns	`3010667` ns	`1.08`
`array/private/iteration/logical`	`2693625` ns	`2785479.5` ns	`0.97`
`array/private/iteration/findmin/1d`	`1804708` ns	`1798000` ns	`1.00`
`array/private/iteration/findmin/2d`	`1533750` ns	`1541250` ns	`1.00`
`array/private/copy`	`897209` ns	`845208` ns	`1.06`
`array/shared/copyto!/gpu_to_gpu`	`78917` ns	`82750` ns	`0.95`
`array/shared/copyto!/cpu_to_gpu`	`81000` ns	`83417` ns	`0.97`
`array/shared/copyto!/gpu_to_cpu`	`79000` ns	`79208` ns	`1.00`
`array/shared/iteration/findall/int`	`1631542` ns	`1688833` ns	`0.97`
`array/shared/iteration/findall/bool`	`1577458.5` ns	`1594000` ns	`0.99`
`array/shared/iteration/findfirst/int`	`1486229` ns	`1482792` ns	`1.00`
`array/shared/iteration/findfirst/bool`	`1407083` ns	`1418770.5` ns	`0.99`
`array/shared/iteration/scalar`	`156708` ns	`162375` ns	`0.97`
`array/shared/iteration/logical`	`2492563` ns	`2631645.5` ns	`0.95`
`array/shared/iteration/findmin/1d`	`1443541.5` ns	`1454209` ns	`0.99`
`array/shared/iteration/findmin/2d`	`1545937.5` ns	`1555083` ns	`0.99`
`array/shared/copy`	`205000` ns	`215125` ns	`0.95`
`array/permutedims/4d`	`2599791` ns	`2652167` ns	`0.98`
`array/permutedims/2d`	`1275166` ns	`1302500` ns	`0.98`
`array/permutedims/3d`	`1947125` ns	`1987209` ns	`0.98`
`metal/synchronization/stream`	`14875` ns	`15041` ns	`0.99`
`metal/synchronization/context`	`15500` ns	`15667` ns	`0.99`

This comment was automatically generated by workflow using github-action-benchmark.

codecov · 2025-07-20T21:05:58Z

Codecov Report

❌ Patch coverage is 42.85714% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.61%. Comparing base (4b05d30) to head (e01d410).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
src/mapreduce.jl	42.85%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #627      +/-   ##
==========================================
- Coverage   80.92%   80.61%   -0.32%     
==========================================
  Files          61       61              
  Lines        2705     2713       +8     
==========================================
- Hits         2189     2187       -2     
- Misses        516      526      +10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

maleadt

Conceptually LGTM, but there seem to be a bunch of unrelated changes in here?

perf/array.jl

src/pool.jl

christiangnrd · 2025-07-29T14:57:47Z

Conceptually LGTM, but there seem to be a bunch of unrelated changes in here?

I was debugging hanging benchmark CI which I think is caused by the signposts pr making the "construct" benchmark to run for many more iterations, which means much more memory usage, which I believe sometimes crosses the line on the 8GB ram of the CI mac minis.

I set evals=1 on it, which should reduce memory usage and also bypass tuning. Is this a good/sufficient solution? (assuming it works)

maleadt · 2025-07-29T15:00:11Z

it's only a 8-byte array being allocated, so I cannot imagine that causing an OOM?

christiangnrd · 2025-07-29T15:46:06Z

it's only a 8-byte array being allocated, so I cannot imagine that causing an OOM?

Maybe the GC isn't being triggered paired with all the allocations from the ObjectiveC.jl machinery? Also maybe a full 16KiB page is reserved for every buffer allocation?

Just running @benchmark shows 3x more memory and allocations without the signposts. The @time number is total memory (right?) not max but as I state above maybe the gc isn't keeping up?

# No signposts
julia> using Metal, BenchmarkTools; MtlArray{Int,1}(undef, 1); @time @benchmark MtlArray{Int,1}(undef, 1)
  4.650043 seconds (23.76 M allocations: 714.837 MiB, 4.81% gc time, 7.83% compilation time)
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (min … max):  3.994 μs …  16.820 ms  ┊ GC (min … max): 0.00% … 8.10%
 Time  (median):     5.327 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.034 μs ± 168.149 μs  ┊ GC (mean ± σ):  1.94% ± 0.08%

               ▁▁▂▂▃▃▃▄▇▆██▆▇▇▇▆▇█▅▆▅▅▄▃▄▂▂ ▁                  
  ▁▁▂▂▃▃▄▅▄▅▆▇▇██████████████████████████████▇▇▇▆▆▄▃▃▃▂▂▂▂▂▂▂ ▅
  3.99 μs         Histogram: frequency by time        6.77 μs <

 Memory estimate: 1.02 KiB, allocs estimate: 33.

#With signposts
julia> using Metal, BenchmarkTools; MtlArray{Int,1}(undef, 1); @time @benchmark MtlArray{Int,1}(undef, 1)
 14.749792 seconds (7.28 M allocations: 256.324 MiB, 1.21% gc time, 2.56% compilation time)
BenchmarkTools.Trial: 7284 samples with 7 evaluations per sample.
 Range (min … max):    4.661 μs … 484.225 ms  ┊ GC (min … max): 0.00% … 0.65%
 Time  (median):      15.619 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   136.726 μs ±   5.705 ms  ┊ GC (mean ± σ):  0.32% ± 0.01%

             ▃▇██▇▄▄▁                                            
  ▃▃▃▃▃▄▄▃▄▆█████████▇▆▅▄▄▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▁▂▁▂▁▂▂ ▃
  4.66 μs          Histogram: frequency by time         49.1 μs <

 Memory estimate: 1.22 KiB, allocs estimate: 39.

maleadt · 2025-07-29T16:14:53Z

BenchmarkTools.jl should force-run the GC after every trial. You can try setting gcsample to force it to run after every sample too.

Also maybe a full 16KiB page is reserved for every buffer allocation?

That's still only 1GiB:

julia> Base.format_bytes(10000*7*16*1024)  # (samples * evals * 16KiB)
"1.068 GiB"

christiangnrd · 2025-07-29T20:38:48Z

You can try setting gcsample to force it to run after every sample too.

I tried that but it would invalidate old benchmark data and while I'm not opposed to resetting the historical data for that test, evals=1 seems to be sufficient without needing to do so. Adding gcsample=true seems to increase timings. Not sure why that would happen since afaik the gc time isn't counted.

julia> using Metal, BenchmarkTools; MtlArray{Int,1}(undef, 1); @time @benchmark MtlArray{Int,1}(undef, 1)
  4.656013 seconds (21.47 M allocations: 608.718 MiB, 5.34% gc time, 0.34% compilation time)
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (min … max):  4.095 μs …  20.375 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.518 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.593 μs ± 773.620 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

            ▁▂▃▂▅▄▇▅▇▇▅█▆▇▅▇▆▄▆▄▄▂▂▁                           
  ▁▁▂▃▃▄▅▆▆█████████████████████████▇█▆▅▅▅▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁ ▅
  4.1 μs          Histogram: frequency by time        7.76 μs <

 Memory estimate: 1.02 KiB, allocs estimate: 33.

julia> using Metal, BenchmarkTools; MtlArray{Int,1}(undef, 1); @time @benchmark MtlArray{Int,1}(undef, 1) gcsample=true
 10.590237 seconds (2.11 M allocations: 105.136 MiB, 96.13% gc time, 3.44% compilation time)
BenchmarkTools.Trial: 78 samples with 6 evaluations per sample.
 Range (min … max):  10.639 μs … 23.403 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     15.417 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   15.253 μs ±  2.559 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▅ ▅             ▂ ▅  ▂▅  ▂▂  █▂                         
  ▅▁▁▁█████▁▅▅█▅▅▁▅█▅█▁███▁████████▅██▁▁█▅▁▅▅▁█▁▁█▁▁▁▁▁▁▁▁▁▅▅ ▁
  10.6 μs         Histogram: frequency by time        21.4 μs <

 Memory estimate: 1.02 KiB, allocs estimate: 33.

I would like to more deeply investigate why this is happening when I have the time, but I feel like this is probably good enough for now to get benchmarks not hanging anymore. It wouldn't surprise me if it ended up being a missing autoreleasepool or something.

I've cleaned up the commit history because I don't think we should squash

maleadt · 2025-07-30T06:58:31Z

Adding gcsample=true seems to increase timings.

Oof, that shouldn't happen.

I'm any case, it doesn't seem worth your time to investigate this detail, so let's go ahead with evals=1. Maybe add a simple note though on why.

Seems to solve some hangs on the CI runners

christiangnrd mentioned this pull request Jul 20, 2025

[Do not merge] Switch to GPUArrays.jl reduction implementation #628

Open

github-actions bot reviewed Jul 20, 2025

View reviewed changes

christiangnrd force-pushed the mapredopt branch from 6a89279 to 69f84dc Compare July 20, 2025 20:08

christiangnrd force-pushed the mapredopt branch 4 times, most recently from e16f284 to 186685d Compare July 29, 2025 00:00

maleadt approved these changes Jul 29, 2025

View reviewed changes

perf/array.jl Outdated Show resolved Hide resolved

src/pool.jl Outdated Show resolved Hide resolved

christiangnrd force-pushed the mapredopt branch 2 times, most recently from 392c2b7 to 5c615d1 Compare July 29, 2025 14:57

christiangnrd force-pushed the mapredopt branch 3 times, most recently from fd0fbb9 to 8e2ddc3 Compare July 29, 2025 20:35

christiangnrd added 2 commits July 31, 2025 10:22

Only one eval per sample in "construct" benchmark

160a390

Seems to solve some hangs on the CI runners

Port mapreduce optimization from CUDA

e01d410

christiangnrd force-pushed the mapredopt branch from 8e2ddc3 to e01d410 Compare August 1, 2025 00:56

christiangnrd merged commit 422c43f into main Aug 1, 2025
6 of 7 checks passed

christiangnrd deleted the mapredopt branch August 1, 2025 01:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port mapreduce optimization from CUDA #627

Port mapreduce optimization from CUDA #627

Uh oh!

christiangnrd commented Jul 20, 2025

Uh oh!

github-actions bot commented Jul 20, 2025 •

edited

Loading

Uh oh!

github-actions bot left a comment •

edited

Loading

Uh oh!

codecov bot commented Jul 20, 2025 •

edited

Loading

Uh oh!

maleadt left a comment

Uh oh!

Uh oh!

Uh oh!

christiangnrd commented Jul 29, 2025

Uh oh!

maleadt commented Jul 29, 2025

Uh oh!

christiangnrd commented Jul 29, 2025 •

edited

Loading

Uh oh!

maleadt commented Jul 29, 2025

Uh oh!

christiangnrd commented Jul 29, 2025 •

edited

Loading

Uh oh!

maleadt commented Jul 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Port mapreduce optimization from CUDA #627

Port mapreduce optimization from CUDA #627

Uh oh!

Conversation

christiangnrd commented Jul 20, 2025

Uh oh!

github-actions bot commented Jul 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Metal Benchmarks

Uh oh!

codecov bot commented Jul 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

maleadt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

christiangnrd commented Jul 29, 2025

Uh oh!

maleadt commented Jul 29, 2025

Uh oh!

christiangnrd commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maleadt commented Jul 29, 2025

Uh oh!

christiangnrd commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maleadt commented Jul 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Jul 20, 2025 •

edited

Loading

github-actions bot left a comment •

edited

Loading

codecov bot commented Jul 20, 2025 •

edited

Loading

christiangnrd commented Jul 29, 2025 •

edited

Loading

christiangnrd commented Jul 29, 2025 •

edited

Loading