Skip to content

Conversation

@christiangnrd
Copy link
Member

Depends on #626

This is so that benchmarking against AK implementations doesn't show big discrepancy on cuda vs on metal.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 20, 2025

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic main) to apply these changes.

Click here to view the suggested changes.
diff --git a/perf/array.jl b/perf/array.jl
index 21cd4677..4df2789c 100644
--- a/perf/array.jl
+++ b/perf/array.jl
@@ -68,7 +68,7 @@ gpu_vec_ints = reshape(gpu_mat_ints, length(gpu_mat_ints))
 
 # 'evals=1' added to prevent hang when running benchmarks of CI
 # TODO: Investigate cause and properly fix.
-group["construct"] = @benchmarkable MtlArray{Int,1}(undef, 1) evals=1
+group["construct"] = @benchmarkable MtlArray{Int, 1}(undef, 1) evals = 1
 
 group["broadcast"] = @benchmarkable Metal.@sync $gpu_mat .= 0f0
 

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metal Benchmarks

Details
Benchmark suite Current: e01d410 Previous: 4b05d30 Ratio
latency/precompile 9944234125 ns 9958861667 ns 1.00
latency/ttfp 4031131250 ns 4033351208 ns 1.00
latency/import 1296414896 ns 1297505000 ns 1.00
integration/metaldevrt 903042 ns 902875 ns 1.00
integration/byval/slices=1 1635041.5 ns 1648541 ns 0.99
integration/byval/slices=3 19608167 ns 22029125 ns 0.89
integration/byval/reference 1620250 ns 1623791 ns 1.00
integration/byval/slices=2 2772625.5 ns 2786541 ns 1.00
kernel/indexing 468500 ns 520645.5 ns 0.90
kernel/indexing_checked 507333 ns 518542 ns 0.98
kernel/launch 8666.5 ns 7917 ns 1.09
array/construct 5833 ns 6347.333333333333 ns 0.92
array/broadcast 536125 ns 543229 ns 0.99
array/random/randn/Float32 903979 ns 950624.5 ns 0.95
array/random/randn!/Float32 598458 ns 600750 ns 1.00
array/random/rand!/Int64 551083 ns 551666 ns 1.00
array/random/rand!/Float32 552208.5 ns 549417 ns 1.01
array/random/rand/Int64 942833.5 ns 938188 ns 1.00
array/random/rand/Float32 858354 ns 869917 ns 0.99
array/accumulate/Int64/1d 1389916.5 ns 1410625 ns 0.99
array/accumulate/Int64/dims=1 1928916 ns 1979291 ns 0.97
array/accumulate/Int64/dims=2 2308792 ns 2362542 ns 0.98
array/accumulate/Int64/dims=1L 12566812.5 ns 12465354 ns 1.01
array/accumulate/Int64/dims=2L 9999062.5 ns 10148791 ns 0.99
array/accumulate/Float32/1d 1185270.5 ns 1224854 ns 0.97
array/accumulate/Float32/dims=1 1690354.5 ns 1703208 ns 0.99
array/accumulate/Float32/dims=2 2100500 ns 2167708 ns 0.97
array/accumulate/Float32/dims=1L 10556291.5 ns 10650916.5 ns 0.99
array/accumulate/Float32/dims=2L 7568979.5 ns 7679250 ns 0.99
array/reductions/reduce/Int64/1d 1271959 ns 1263833.5 ns 1.01
array/reductions/reduce/Int64/dims=1 1162583 ns 1165042 ns 1.00
array/reductions/reduce/Int64/dims=2 1302646 ns 1280125 ns 1.02
array/reductions/reduce/Int64/dims=1L 2168667 ns 20815958.5 ns 0.10
array/reductions/reduce/Int64/dims=2L 3589625 ns 3593875 ns 1.00
array/reductions/reduce/Float32/1d 806083 ns 798792 ns 1.01
array/reductions/reduce/Float32/dims=1 833792 ns 852417 ns 0.98
array/reductions/reduce/Float32/dims=2 747125 ns 759000 ns 0.98
array/reductions/reduce/Float32/dims=1L 1864958.5 ns 7077896 ns 0.26
array/reductions/reduce/Float32/dims=2L 1891104.5 ns 1897354.5 ns 1.00
array/reductions/mapreduce/Int64/1d 1285667 ns 1295374.5 ns 0.99
array/reductions/mapreduce/Int64/dims=1 1171709 ns 1157625 ns 1.01
array/reductions/mapreduce/Int64/dims=2 1291125 ns 1294375 ns 1.00
array/reductions/mapreduce/Int64/dims=1L 2211625 ns 20902334 ns 0.11
array/reductions/mapreduce/Int64/dims=2L 3597875 ns 3606709 ns 1.00
array/reductions/mapreduce/Float32/1d 790917 ns 818041.5 ns 0.97
array/reductions/mapreduce/Float32/dims=1 834333.5 ns 852604.5 ns 0.98
array/reductions/mapreduce/Float32/dims=2 747625 ns 749937.5 ns 1.00
array/reductions/mapreduce/Float32/dims=1L 1823917 ns 7111292 ns 0.26
array/reductions/mapreduce/Float32/dims=2L 1913167 ns 1908333.5 ns 1.00
array/private/copyto!/gpu_to_gpu 582833 ns 590083.5 ns 0.99
array/private/copyto!/cpu_to_gpu 728708 ns 711041 ns 1.02
array/private/copyto!/gpu_to_cpu 750708 ns 770062.5 ns 0.97
array/private/iteration/findall/int 1626750 ns 1686125 ns 0.96
array/private/iteration/findall/bool 1558291.5 ns 1580166 ns 0.99
array/private/iteration/findfirst/int 1836041 ns 1852416 ns 0.99
array/private/iteration/findfirst/bool 1727083 ns 1757666.5 ns 0.98
array/private/iteration/scalar 3240167 ns 3010667 ns 1.08
array/private/iteration/logical 2693625 ns 2785479.5 ns 0.97
array/private/iteration/findmin/1d 1804708 ns 1798000 ns 1.00
array/private/iteration/findmin/2d 1533750 ns 1541250 ns 1.00
array/private/copy 897209 ns 845208 ns 1.06
array/shared/copyto!/gpu_to_gpu 78917 ns 82750 ns 0.95
array/shared/copyto!/cpu_to_gpu 81000 ns 83417 ns 0.97
array/shared/copyto!/gpu_to_cpu 79000 ns 79208 ns 1.00
array/shared/iteration/findall/int 1631542 ns 1688833 ns 0.97
array/shared/iteration/findall/bool 1577458.5 ns 1594000 ns 0.99
array/shared/iteration/findfirst/int 1486229 ns 1482792 ns 1.00
array/shared/iteration/findfirst/bool 1407083 ns 1418770.5 ns 0.99
array/shared/iteration/scalar 156708 ns 162375 ns 0.97
array/shared/iteration/logical 2492563 ns 2631645.5 ns 0.95
array/shared/iteration/findmin/1d 1443541.5 ns 1454209 ns 0.99
array/shared/iteration/findmin/2d 1545937.5 ns 1555083 ns 0.99
array/shared/copy 205000 ns 215125 ns 0.95
array/permutedims/4d 2599791 ns 2652167 ns 0.98
array/permutedims/2d 1275166 ns 1302500 ns 0.98
array/permutedims/3d 1947125 ns 1987209 ns 0.98
metal/synchronization/stream 14875 ns 15041 ns 0.99
metal/synchronization/context 15500 ns 15667 ns 0.99

This comment was automatically generated by workflow using github-action-benchmark.

@codecov
Copy link

codecov bot commented Jul 20, 2025

Codecov Report

❌ Patch coverage is 42.85714% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.61%. Comparing base (4b05d30) to head (e01d410).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
src/mapreduce.jl 42.85% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #627      +/-   ##
==========================================
- Coverage   80.92%   80.61%   -0.32%     
==========================================
  Files          61       61              
  Lines        2705     2713       +8     
==========================================
- Hits         2189     2187       -2     
- Misses        516      526      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@christiangnrd christiangnrd force-pushed the mapredopt branch 4 times, most recently from e16f284 to 186685d Compare July 29, 2025 00:00
Copy link
Member

@maleadt maleadt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually LGTM, but there seem to be a bunch of unrelated changes in here?

@christiangnrd christiangnrd force-pushed the mapredopt branch 2 times, most recently from 392c2b7 to 5c615d1 Compare July 29, 2025 14:57
@christiangnrd
Copy link
Member Author

Conceptually LGTM, but there seem to be a bunch of unrelated changes in here?

I was debugging hanging benchmark CI which I think is caused by the signposts pr making the "construct" benchmark to run for many more iterations, which means much more memory usage, which I believe sometimes crosses the line on the 8GB ram of the CI mac minis.

I set evals=1 on it, which should reduce memory usage and also bypass tuning. Is this a good/sufficient solution? (assuming it works)

@maleadt
Copy link
Member

maleadt commented Jul 29, 2025

it's only a 8-byte array being allocated, so I cannot imagine that causing an OOM?

@christiangnrd
Copy link
Member Author

christiangnrd commented Jul 29, 2025

it's only a 8-byte array being allocated, so I cannot imagine that causing an OOM?

Maybe the GC isn't being triggered paired with all the allocations from the ObjectiveC.jl machinery? Also maybe a full 16KiB page is reserved for every buffer allocation?

Just running @benchmark shows 3x more memory and allocations without the signposts. The @time number is total memory (right?) not max but as I state above maybe the gc isn't keeping up?

# No signposts
julia> using Metal, BenchmarkTools; MtlArray{Int,1}(undef, 1); @time @benchmark MtlArray{Int,1}(undef, 1)
  4.650043 seconds (23.76 M allocations: 714.837 MiB, 4.81% gc time, 7.83% compilation time)
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (min … max):  3.994 μs …  16.820 ms  ┊ GC (min … max): 0.00% … 8.10%
 Time  (median):     5.327 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.034 μs ± 168.149 μs  ┊ GC (mean ± σ):  1.94% ± 0.08%

               ▁▁▂▂▃▃▃▄▇▆██▆▇▇▇▆▇█▅▆▅▅▄▃▄▂▂ ▁                  
  ▁▁▂▂▃▃▄▅▄▅▆▇▇██████████████████████████████▇▇▇▆▆▄▃▃▃▂▂▂▂▂▂▂ ▅
  3.99 μs         Histogram: frequency by time        6.77 μs <

 Memory estimate: 1.02 KiB, allocs estimate: 33.

#With signposts
julia> using Metal, BenchmarkTools; MtlArray{Int,1}(undef, 1); @time @benchmark MtlArray{Int,1}(undef, 1)
 14.749792 seconds (7.28 M allocations: 256.324 MiB, 1.21% gc time, 2.56% compilation time)
BenchmarkTools.Trial: 7284 samples with 7 evaluations per sample.
 Range (min … max):    4.661 μs … 484.225 ms  ┊ GC (min … max): 0.00% … 0.65%
 Time  (median):      15.619 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   136.726 μs ±   5.705 ms  ┊ GC (mean ± σ):  0.32% ± 0.01%

             ▃▇██▇▄▄▁                                            
  ▃▃▃▃▃▄▄▃▄▆█████████▇▆▅▄▄▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▁▂▁▂▁▂▂ ▃
  4.66 μs          Histogram: frequency by time         49.1 μs <

 Memory estimate: 1.22 KiB, allocs estimate: 39.

@maleadt
Copy link
Member

maleadt commented Jul 29, 2025

BenchmarkTools.jl should force-run the GC after every trial. You can try setting gcsample to force it to run after every sample too.

Also maybe a full 16KiB page is reserved for every buffer allocation?

That's still only 1GiB:

julia> Base.format_bytes(10000*7*16*1024)  # (samples * evals * 16KiB)
"1.068 GiB"

@christiangnrd christiangnrd force-pushed the mapredopt branch 3 times, most recently from fd0fbb9 to 8e2ddc3 Compare July 29, 2025 20:35
@christiangnrd
Copy link
Member Author

christiangnrd commented Jul 29, 2025

You can try setting gcsample to force it to run after every sample too.

I tried that but it would invalidate old benchmark data and while I'm not opposed to resetting the historical data for that test, evals=1 seems to be sufficient without needing to do so. Adding gcsample=true seems to increase timings. Not sure why that would happen since afaik the gc time isn't counted.

julia> using Metal, BenchmarkTools; MtlArray{Int,1}(undef, 1); @time @benchmark MtlArray{Int,1}(undef, 1)
  4.656013 seconds (21.47 M allocations: 608.718 MiB, 5.34% gc time, 0.34% compilation time)
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (min … max):  4.095 μs …  20.375 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.518 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.593 μs ± 773.620 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

            ▁▂▃▂▅▄▇▅▇▇▅█▆▇▅▇▆▄▆▄▄▂▂▁                           
  ▁▁▂▃▃▄▅▆▆█████████████████████████▇█▆▅▅▅▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁ ▅
  4.1 μs          Histogram: frequency by time        7.76 μs <

 Memory estimate: 1.02 KiB, allocs estimate: 33.

julia> using Metal, BenchmarkTools; MtlArray{Int,1}(undef, 1); @time @benchmark MtlArray{Int,1}(undef, 1) gcsample=true
 10.590237 seconds (2.11 M allocations: 105.136 MiB, 96.13% gc time, 3.44% compilation time)
BenchmarkTools.Trial: 78 samples with 6 evaluations per sample.
 Range (min … max):  10.639 μs … 23.403 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     15.417 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   15.253 μs ±  2.559 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▅ ▅             ▂ ▅  ▂▅  ▂▂  █▂                         
  ▅▁▁▁█████▁▅▅█▅▅▁▅█▅█▁███▁████████▅██▁▁█▅▁▅▅▁█▁▁█▁▁▁▁▁▁▁▁▁▅▅ ▁
  10.6 μs         Histogram: frequency by time        21.4 μs <

 Memory estimate: 1.02 KiB, allocs estimate: 33.

I would like to more deeply investigate why this is happening when I have the time, but I feel like this is probably good enough for now to get benchmarks not hanging anymore. It wouldn't surprise me if it ended up being a missing autoreleasepool or something.

I've cleaned up the commit history because I don't think we should squash

@maleadt
Copy link
Member

maleadt commented Jul 30, 2025

Adding gcsample=true seems to increase timings.

Oof, that shouldn't happen.

I'm any case, it doesn't seem worth your time to investigate this detail, so let's go ahead with evals=1. Maybe add a simple note though on why.

@christiangnrd christiangnrd merged commit 422c43f into main Aug 1, 2025
6 of 7 checks passed
@christiangnrd christiangnrd deleted the mapredopt branch August 1, 2025 01:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants