-
Notifications
You must be signed in to change notification settings - Fork 51
Port mapreduce optimization from CUDA #627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Your PR requires formatting changes to meet the project's style guidelines. Click here to view the suggested changes.diff --git a/perf/array.jl b/perf/array.jl
index 21cd4677..4df2789c 100644
--- a/perf/array.jl
+++ b/perf/array.jl
@@ -68,7 +68,7 @@ gpu_vec_ints = reshape(gpu_mat_ints, length(gpu_mat_ints))
# 'evals=1' added to prevent hang when running benchmarks of CI
# TODO: Investigate cause and properly fix.
-group["construct"] = @benchmarkable MtlArray{Int,1}(undef, 1) evals=1
+group["construct"] = @benchmarkable MtlArray{Int, 1}(undef, 1) evals = 1
group["broadcast"] = @benchmarkable Metal.@sync $gpu_mat .= 0f0
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metal Benchmarks
Details
| Benchmark suite | Current: e01d410 | Previous: 4b05d30 | Ratio |
|---|---|---|---|
latency/precompile |
9944234125 ns |
9958861667 ns |
1.00 |
latency/ttfp |
4031131250 ns |
4033351208 ns |
1.00 |
latency/import |
1296414896 ns |
1297505000 ns |
1.00 |
integration/metaldevrt |
903042 ns |
902875 ns |
1.00 |
integration/byval/slices=1 |
1635041.5 ns |
1648541 ns |
0.99 |
integration/byval/slices=3 |
19608167 ns |
22029125 ns |
0.89 |
integration/byval/reference |
1620250 ns |
1623791 ns |
1.00 |
integration/byval/slices=2 |
2772625.5 ns |
2786541 ns |
1.00 |
kernel/indexing |
468500 ns |
520645.5 ns |
0.90 |
kernel/indexing_checked |
507333 ns |
518542 ns |
0.98 |
kernel/launch |
8666.5 ns |
7917 ns |
1.09 |
array/construct |
5833 ns |
6347.333333333333 ns |
0.92 |
array/broadcast |
536125 ns |
543229 ns |
0.99 |
array/random/randn/Float32 |
903979 ns |
950624.5 ns |
0.95 |
array/random/randn!/Float32 |
598458 ns |
600750 ns |
1.00 |
array/random/rand!/Int64 |
551083 ns |
551666 ns |
1.00 |
array/random/rand!/Float32 |
552208.5 ns |
549417 ns |
1.01 |
array/random/rand/Int64 |
942833.5 ns |
938188 ns |
1.00 |
array/random/rand/Float32 |
858354 ns |
869917 ns |
0.99 |
array/accumulate/Int64/1d |
1389916.5 ns |
1410625 ns |
0.99 |
array/accumulate/Int64/dims=1 |
1928916 ns |
1979291 ns |
0.97 |
array/accumulate/Int64/dims=2 |
2308792 ns |
2362542 ns |
0.98 |
array/accumulate/Int64/dims=1L |
12566812.5 ns |
12465354 ns |
1.01 |
array/accumulate/Int64/dims=2L |
9999062.5 ns |
10148791 ns |
0.99 |
array/accumulate/Float32/1d |
1185270.5 ns |
1224854 ns |
0.97 |
array/accumulate/Float32/dims=1 |
1690354.5 ns |
1703208 ns |
0.99 |
array/accumulate/Float32/dims=2 |
2100500 ns |
2167708 ns |
0.97 |
array/accumulate/Float32/dims=1L |
10556291.5 ns |
10650916.5 ns |
0.99 |
array/accumulate/Float32/dims=2L |
7568979.5 ns |
7679250 ns |
0.99 |
array/reductions/reduce/Int64/1d |
1271959 ns |
1263833.5 ns |
1.01 |
array/reductions/reduce/Int64/dims=1 |
1162583 ns |
1165042 ns |
1.00 |
array/reductions/reduce/Int64/dims=2 |
1302646 ns |
1280125 ns |
1.02 |
array/reductions/reduce/Int64/dims=1L |
2168667 ns |
20815958.5 ns |
0.10 |
array/reductions/reduce/Int64/dims=2L |
3589625 ns |
3593875 ns |
1.00 |
array/reductions/reduce/Float32/1d |
806083 ns |
798792 ns |
1.01 |
array/reductions/reduce/Float32/dims=1 |
833792 ns |
852417 ns |
0.98 |
array/reductions/reduce/Float32/dims=2 |
747125 ns |
759000 ns |
0.98 |
array/reductions/reduce/Float32/dims=1L |
1864958.5 ns |
7077896 ns |
0.26 |
array/reductions/reduce/Float32/dims=2L |
1891104.5 ns |
1897354.5 ns |
1.00 |
array/reductions/mapreduce/Int64/1d |
1285667 ns |
1295374.5 ns |
0.99 |
array/reductions/mapreduce/Int64/dims=1 |
1171709 ns |
1157625 ns |
1.01 |
array/reductions/mapreduce/Int64/dims=2 |
1291125 ns |
1294375 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=1L |
2211625 ns |
20902334 ns |
0.11 |
array/reductions/mapreduce/Int64/dims=2L |
3597875 ns |
3606709 ns |
1.00 |
array/reductions/mapreduce/Float32/1d |
790917 ns |
818041.5 ns |
0.97 |
array/reductions/mapreduce/Float32/dims=1 |
834333.5 ns |
852604.5 ns |
0.98 |
array/reductions/mapreduce/Float32/dims=2 |
747625 ns |
749937.5 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=1L |
1823917 ns |
7111292 ns |
0.26 |
array/reductions/mapreduce/Float32/dims=2L |
1913167 ns |
1908333.5 ns |
1.00 |
array/private/copyto!/gpu_to_gpu |
582833 ns |
590083.5 ns |
0.99 |
array/private/copyto!/cpu_to_gpu |
728708 ns |
711041 ns |
1.02 |
array/private/copyto!/gpu_to_cpu |
750708 ns |
770062.5 ns |
0.97 |
array/private/iteration/findall/int |
1626750 ns |
1686125 ns |
0.96 |
array/private/iteration/findall/bool |
1558291.5 ns |
1580166 ns |
0.99 |
array/private/iteration/findfirst/int |
1836041 ns |
1852416 ns |
0.99 |
array/private/iteration/findfirst/bool |
1727083 ns |
1757666.5 ns |
0.98 |
array/private/iteration/scalar |
3240167 ns |
3010667 ns |
1.08 |
array/private/iteration/logical |
2693625 ns |
2785479.5 ns |
0.97 |
array/private/iteration/findmin/1d |
1804708 ns |
1798000 ns |
1.00 |
array/private/iteration/findmin/2d |
1533750 ns |
1541250 ns |
1.00 |
array/private/copy |
897209 ns |
845208 ns |
1.06 |
array/shared/copyto!/gpu_to_gpu |
78917 ns |
82750 ns |
0.95 |
array/shared/copyto!/cpu_to_gpu |
81000 ns |
83417 ns |
0.97 |
array/shared/copyto!/gpu_to_cpu |
79000 ns |
79208 ns |
1.00 |
array/shared/iteration/findall/int |
1631542 ns |
1688833 ns |
0.97 |
array/shared/iteration/findall/bool |
1577458.5 ns |
1594000 ns |
0.99 |
array/shared/iteration/findfirst/int |
1486229 ns |
1482792 ns |
1.00 |
array/shared/iteration/findfirst/bool |
1407083 ns |
1418770.5 ns |
0.99 |
array/shared/iteration/scalar |
156708 ns |
162375 ns |
0.97 |
array/shared/iteration/logical |
2492563 ns |
2631645.5 ns |
0.95 |
array/shared/iteration/findmin/1d |
1443541.5 ns |
1454209 ns |
0.99 |
array/shared/iteration/findmin/2d |
1545937.5 ns |
1555083 ns |
0.99 |
array/shared/copy |
205000 ns |
215125 ns |
0.95 |
array/permutedims/4d |
2599791 ns |
2652167 ns |
0.98 |
array/permutedims/2d |
1275166 ns |
1302500 ns |
0.98 |
array/permutedims/3d |
1947125 ns |
1987209 ns |
0.98 |
metal/synchronization/stream |
14875 ns |
15041 ns |
0.99 |
metal/synchronization/context |
15500 ns |
15667 ns |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #627 +/- ##
==========================================
- Coverage 80.92% 80.61% -0.32%
==========================================
Files 61 61
Lines 2705 2713 +8
==========================================
- Hits 2189 2187 -2
- Misses 516 526 +10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
e16f284 to
186685d
Compare
maleadt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually LGTM, but there seem to be a bunch of unrelated changes in here?
392c2b7 to
5c615d1
Compare
I was debugging hanging benchmark CI which I think is caused by the signposts pr making the "construct" benchmark to run for many more iterations, which means much more memory usage, which I believe sometimes crosses the line on the 8GB ram of the CI mac minis. I set |
|
it's only a 8-byte array being allocated, so I cannot imagine that causing an OOM? |
Maybe the GC isn't being triggered paired with all the allocations from the ObjectiveC.jl machinery? Also maybe a full 16KiB page is reserved for every buffer allocation? Just running # No signposts
julia> using Metal, BenchmarkTools; MtlArray{Int,1}(undef, 1); @time @benchmark MtlArray{Int,1}(undef, 1)
4.650043 seconds (23.76 M allocations: 714.837 MiB, 4.81% gc time, 7.83% compilation time)
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
Range (min … max): 3.994 μs … 16.820 ms ┊ GC (min … max): 0.00% … 8.10%
Time (median): 5.327 μs ┊ GC (median): 0.00%
Time (mean ± σ): 7.034 μs ± 168.149 μs ┊ GC (mean ± σ): 1.94% ± 0.08%
▁▁▂▂▃▃▃▄▇▆██▆▇▇▇▆▇█▅▆▅▅▄▃▄▂▂ ▁
▁▁▂▂▃▃▄▅▄▅▆▇▇██████████████████████████████▇▇▇▆▆▄▃▃▃▂▂▂▂▂▂▂ ▅
3.99 μs Histogram: frequency by time 6.77 μs <
Memory estimate: 1.02 KiB, allocs estimate: 33.
#With signposts
julia> using Metal, BenchmarkTools; MtlArray{Int,1}(undef, 1); @time @benchmark MtlArray{Int,1}(undef, 1)
14.749792 seconds (7.28 M allocations: 256.324 MiB, 1.21% gc time, 2.56% compilation time)
BenchmarkTools.Trial: 7284 samples with 7 evaluations per sample.
Range (min … max): 4.661 μs … 484.225 ms ┊ GC (min … max): 0.00% … 0.65%
Time (median): 15.619 μs ┊ GC (median): 0.00%
Time (mean ± σ): 136.726 μs ± 5.705 ms ┊ GC (mean ± σ): 0.32% ± 0.01%
▃▇██▇▄▄▁
▃▃▃▃▃▄▄▃▄▆█████████▇▆▅▄▄▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▁▂▁▂▁▂▂ ▃
4.66 μs Histogram: frequency by time 49.1 μs <
Memory estimate: 1.22 KiB, allocs estimate: 39. |
|
BenchmarkTools.jl should force-run the GC after every
That's still only 1GiB: julia> Base.format_bytes(10000*7*16*1024) # (samples * evals * 16KiB)
"1.068 GiB" |
fd0fbb9 to
8e2ddc3
Compare
I tried that but it would invalidate old benchmark data and while I'm not opposed to resetting the historical data for that test, julia> using Metal, BenchmarkTools; MtlArray{Int,1}(undef, 1); @time @benchmark MtlArray{Int,1}(undef, 1)
4.656013 seconds (21.47 M allocations: 608.718 MiB, 5.34% gc time, 0.34% compilation time)
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
Range (min … max): 4.095 μs … 20.375 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.518 μs ┊ GC (median): 0.00%
Time (mean ± σ): 5.593 μs ± 773.620 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▂▃▂▅▄▇▅▇▇▅█▆▇▅▇▆▄▆▄▄▂▂▁
▁▁▂▃▃▄▅▆▆█████████████████████████▇█▆▅▅▅▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁ ▅
4.1 μs Histogram: frequency by time 7.76 μs <
Memory estimate: 1.02 KiB, allocs estimate: 33.
julia> using Metal, BenchmarkTools; MtlArray{Int,1}(undef, 1); @time @benchmark MtlArray{Int,1}(undef, 1) gcsample=true
10.590237 seconds (2.11 M allocations: 105.136 MiB, 96.13% gc time, 3.44% compilation time)
BenchmarkTools.Trial: 78 samples with 6 evaluations per sample.
Range (min … max): 10.639 μs … 23.403 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 15.417 μs ┊ GC (median): 0.00%
Time (mean ± σ): 15.253 μs ± 2.559 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▅ ▅ ▂ ▅ ▂▅ ▂▂ █▂
▅▁▁▁█████▁▅▅█▅▅▁▅█▅█▁███▁████████▅██▁▁█▅▁▅▅▁█▁▁█▁▁▁▁▁▁▁▁▁▅▅ ▁
10.6 μs Histogram: frequency by time 21.4 μs <
Memory estimate: 1.02 KiB, allocs estimate: 33.I would like to more deeply investigate why this is happening when I have the time, but I feel like this is probably good enough for now to get benchmarks not hanging anymore. It wouldn't surprise me if it ended up being a missing autoreleasepool or something. I've cleaned up the commit history because I don't think we should squash |
Oof, that shouldn't happen. I'm any case, it doesn't seem worth your time to investigate this detail, so let's go ahead with |
Seems to solve some hangs on the CI runners
8e2ddc3 to
e01d410
Compare
Depends on #626
This is so that benchmarking against AK implementations doesn't show big discrepancy on cuda vs on metal.