Improve Performance of `reduce(hcat, ::ColVecs)` and `reduce(vcat, ::ColVecs)` by specialization #510

FelixBenning · 2023-05-24T10:24:21Z

Summary

The idea is that

reduce(hcat, ColVecs(X))

should in principle be a no-op.

Proposed changes
Specialization of reduce on typeof(hcat) and typeof(vcat).

The drawback is more code. But I added test coverage and it was something very easy to do 🤷 . Whether you want this or not is up to you.

codecov · 2023-05-24T10:28:54Z

Codecov Report

Patch coverage: 100.00% and project coverage change: -3.71 ⚠️

Comparison is base (ef6d459) 94.16% compared to head (387837f) 90.45%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     JuliaLang/julia#510      +/-   ##
==========================================
- Coverage   94.16%   90.45%   -3.71%     
==========================================
  Files          52       52              
  Lines        1387     1393       +6     
==========================================
- Hits         1306     1260      -46     
- Misses         81      133      +52

Impacted Files	Coverage Δ
src/utils.jl	`92.04% <100.00%> (+0.58%)`	⬆️

... and 8 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

willtebbutt

LGTM. Please bump the patch version, then we can merge when CI passes

FelixBenning · 2023-05-24T10:59:59Z

I have no idea why the tests would fail - something about constant allocation heuristics in Zygote?

devmotion

I think similar definitions should be added in this PR for RowVecs for consistency.

src/utils.jl

willtebbutt · 2023-05-24T11:55:54Z

I have no idea why the tests would fail - something about constant allocation heuristics in Zygote?

Is it clear whether they're happening in areas where the modifications introduced here cause problems, or do they appear completely spurious?

Co-authored-by: David Widmann <devmotion@users.noreply.github.com>

FelixBenning · 2023-05-24T12:36:34Z

they definitely appear completely spurious. The area seems completely unrelated. But I also don't understand the code in that area to be honest

FelixBenning · 2023-05-24T12:43:49Z

I think similar definitions should be added in this PR for RowVecs for consistency.

If you hcat RowVecs you would expect something else

A = [ a11 ... a1n;
      ...
      am1 ... amn]
RowVecs(A) == [ [a11 ... a1n], ... [am1, ... amn] ]

So

reduce(hcat, RowVecs(A)) == [
    a11 ... am1
    ...
    a1n ... 1mn
] # == A'

But if you have to transpose A' you have to move all of the memory around, which is probably already equivalent to the existing reduce(hcat, ::AbstractVector)

That is why I did not add it (because I don't know if that would actually result in a performance benefit. RowVecs definitely go against the grain of julias column major matrices.)

But if you believe this is worth the additional lines of code, I can add these definitions for RowVecs :)

devmotion · 2023-05-24T13:06:36Z

But if you have to transpose A' you have to move all of the memory around, which is probably already equivalent to the existing reduce(hcat, ::AbstractVector)

It's defined here: https://github.com/JuliaLang/julia/blob/310f59019856749fb85bc56a1e3c2e0592a134ad/base/abstractarray.jl#L1638-L1701
Benchmarking would be nice but I would suspect that reduce(::typeof(hcat), x::RowVecs) = permutedims(x.X) would be easier for the compiler. Similarly you could implement e.g. reduce(::typeof(vcat), x::RowVecs) = copy(vec(PermutedDimsArray(x.X, (2, 1)))) in a way that does not compute any actual transposes but only uses views.

RowVecs definitely go against the grain of julias column major matrices.

Many stats packages prefer this format though, I assume at least partly due to consistency with R, Python, and dataframes/datasets in general. Often it also doesn't matter what input format users use - internally you can always use the most performant algorithm.

devmotion · 2023-05-24T13:15:51Z

they definitely appear completely spurious.

Just from looking at the failing tests and logs, I would assume that the test failures are real: They compare the allocations in the pullbacks of transforms that IMO are likely to involve reduce for arguments of type ColVecs and RowVecs. So optimizing ColVecs but not RowVecs could likely explain the reduced allocation for ColVecs (whereas allocations for RowVecs are not affected).

This could be another argument for optimizing RowVecs as well. But I assume that even optimized RowVecs code might still lead to more allocations than ColVecs.

FelixBenning · 2023-05-24T13:46:15Z

How do you guys do benchmarks locally? If I want to use BenchmarkTools I need to add the package (but I don't want to add it to KernelFunctions.jl). If I ] activate test I don't have access to KernelFunctions anymore. ] test works but it seems to do something special... I mean I can add BenchmarkTools I guess and just not commit that change. But that seems contrived

devmotion · 2023-05-24T14:34:07Z

I often use temporary environments. Changing Project.toml and/or Manifest.toml for benchmarking is not a good idea IMO.

FelixBenning · 2023-05-24T14:41:30Z

julia> @benchmark invoke(reduce, Tuple{typeof(hcat), AbstractVector},  hcat, ColVecs($D))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  10.375 μs …  10.470 ms  ┊ GC (min … max):  0.00% … 99.78%
 Time  (median):     18.125 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   29.337 μs ± 269.440 μs  ┊ GC (mean ± σ):  33.74% ±  3.89%

    ▁       ▂▄▅▄▆▆▇█▆▆▄▂▁                                       
  ▃▆███▇▆▆▇████████████████▇▆▅▅▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▂▂▂ ▄
  10.4 μs         Histogram: frequency by time         38.9 μs <

 Memory estimate: 86.12 KiB, allocs estimate: 198.

julia> @benchmark invoke(reduce, Tuple{typeof(hcat), ColVecs},  hcat, ColVecs($D))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  2.208 ns … 23.917 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.375 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.388 ns ±  0.593 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                   █                          
  ▂▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁█▅▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▂ ▂
  2.21 ns        Histogram: frequency by time         2.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark invoke(reduce, Tuple{typeof(hcat), AbstractVector},  hcat, RowVecs($D))
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (min … max):   7.073 μs …  1.815 ms  ┊ GC (min … max):  0.00% … 95.50%
 Time  (median):     11.677 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   16.796 μs ± 73.342 μs  ┊ GC (mean ± σ):  27.95% ±  6.60%

   ▁     ▄▄▁▂▅█▅▂                                              
  ▆██▅▄▄██████████▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂ ▃
  7.07 μs         Histogram: frequency by time        31.9 μs <

 Memory estimate: 57.86 KiB, allocs estimate: 128.

julia> @benchmark invoke(reduce, Tuple{typeof(hcat), RowVecs},  hcat, RowVecs($D))
BenchmarkTools.Trial: 10000 samples with 247 evaluations.
 Range (min … max):  302.126 ns … 121.268 μs  ┊ GC (min … max):  0.00% … 99.45%
 Time  (median):     679.486 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):     1.117 μs ±   2.958 μs  ┊ GC (mean ± σ):  34.72% ± 14.37%

   █                                                             
  ▅█▆▂▂▂▂▂▂▁▂▂▂▂▂▂▁▁▂▁▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▂▂▂▁▂▂▁▁▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▂
  302 ns           Histogram: frequency by time         16.1 μs <

 Memory estimate: 4.81 KiB, allocs estimate: 1.

Interestingly it does help to implement the RowVersion, but it is still slower than the col version which just does nothing

devmotion · 2023-05-24T14:46:25Z

Well, that's exactly what I expected, so I'm not surprised. (BTW I would directly benchmark the functions of interest to avoid measuring invoke as well.)

FelixBenning · 2023-05-24T14:49:53Z

To measure the performance of the fallback I would need to use invoke no? And to make it fair it should use it everywhere. That was my reasoning at least.

devmotion · 2023-05-24T14:51:30Z

You could just run the same benchmark on the master branch and on the PR.

devmotion · 2023-05-24T14:53:42Z

That's also safer in so far as you don't have to reason about which methods are called - maybe there's some specialization for AbstractVector{<:Vector} somewhere else that is used on the master branch instead of the method you benchmarked?

FelixBenning · 2023-05-24T15:47:36Z

So if I understand correctly here the number of allocations are counted:

      kernelmatrix (binary): Test Failed at /home/runner/work/KernelFunctions.jl/KernelFunctions.jl/test/test_utils.jl:387
  Expression: pb[1] == pb[2]
   Evaluated: 504 == 503

and since ColVecs is more efficient now, it has one allocation less? I am not sure how to fix those tests though to be honest.

FelixBenning · 2023-05-24T15:55:25Z

You could just run the same benchmark on the master branch and on the PR.

new

julia> @benchmark reduce(hcat, ColVecs(data)) setup=(data=rand(10,20))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  2.125 ns … 81.958 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.292 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.343 ns ±  0.997 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                     ▃     █      ▁     ▁                     
  ▂▁▁▁▁▁▂▁▁▁▁▁▃▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▄▁▁▁▁▁▃▁▁▁▁▁▂ ▂
  2.12 ns        Histogram: frequency by time         2.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reduce(hcat, RowVecs(data)) setup=(data=rand(10,20))
BenchmarkTools.Trial: 10000 samples with 932 evaluations.
 Range (min … max):  125.850 ns …   5.302 μs  ┊ GC (min … max):  0.00% … 96.04%
 Time  (median):     142.212 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   201.706 ns ± 243.362 ns  ┊ GC (mean ± σ):  19.27% ± 14.85%

  █▅▃▁▄▄▃ ▁                                                     ▁
  ██████████▆▅▅▅▃▃▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▆▇████ █
  126 ns        Histogram: log(frequency) by time       1.41 μs <

 Memory estimate: 1.77 KiB, allocs estimate: 1.

vs old

julia> @benchmark reduce(hcat, ColVecs(data)) setup=(data=rand(10,20))
BenchmarkTools.Trial: 10000 samples with 258 evaluations.
 Range (min … max):  296.349 ns …   5.797 μs  ┊ GC (min … max): 0.00% … 92.52%
 Time  (median):     315.244 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   333.317 ns ± 142.009 ns  ┊ GC (mean ± σ):  3.90% ±  7.70%

  █▇▅▁                                                          ▁
  ████▇██▆▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃ █
  296 ns        Histogram: log(frequency) by time        1.5 μs <

 Memory estimate: 1.77 KiB, allocs estimate: 1.

julia> @benchmark reduce(hcat, RowVecs(data)) setup=(data=rand(10,20))
BenchmarkTools.Trial: 10000 samples with 343 evaluations.
 Range (min … max):  259.233 ns …   5.354 μs  ┊ GC (min … max): 0.00% … 92.93%
 Time  (median):     276.239 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   297.589 ns ± 161.529 ns  ┊ GC (mean ± σ):  5.91% ±  9.18%

  █▅▁                                                           ▁
  ███▇▆▅▇█▆▅▃▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▃▆▆ █
  259 ns        Histogram: log(frequency) by time       1.58 μs <

 Memory estimate: 1.77 KiB, allocs estimate: 1.

src/utils.jl

FelixBenning added 2 commits May 24, 2023 12:15

add specialization for hcat and vcat

0ecd996

test coverage

e8e5562

willtebbutt approved these changes May 24, 2023

View reviewed changes

bump patch

24508ef

devmotion reviewed May 24, 2023

View reviewed changes

src/utils.jl Outdated Show resolved Hide resolved

simplify reshape to vec

15f30c9

Co-authored-by: David Widmann <devmotion@users.noreply.github.com>

add implementation for RowVecs

630258c

formatter

94734f0

devmotion reviewed May 24, 2023

View reviewed changes

src/utils.jl Outdated Show resolved Hide resolved

src/utils.jl Outdated Show resolved Hide resolved

copy to be safe

387837f

FelixBenning mentioned this pull request May 25, 2023

General out indices #511

Open

FelixBenning mentioned this pull request Jun 2, 2023

Add LazyKernelMatrix and lazykernelmatrix #515

Draft

1 task

Improve Performance of reduce(hcat, ::ColVecs) and reduce(vcat, ::ColVecs) by specialization #510

Are you sure you want to change the base?

Improve Performance of reduce(hcat, ::ColVecs) and reduce(vcat, ::ColVecs) by specialization #510

Uh oh!

Conversation

FelixBenning commented May 24, 2023

Uh oh!

codecov bot commented May 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

willtebbutt left a comment

Choose a reason for hiding this comment

Uh oh!

FelixBenning commented May 24, 2023

Uh oh!

devmotion left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

willtebbutt commented May 24, 2023

Uh oh!

FelixBenning commented May 24, 2023

Uh oh!

FelixBenning commented May 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devmotion commented May 24, 2023

Uh oh!

devmotion commented May 24, 2023

Uh oh!

FelixBenning commented May 24, 2023

Uh oh!

devmotion commented May 24, 2023

Uh oh!

FelixBenning commented May 24, 2023

Uh oh!

devmotion commented May 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FelixBenning commented May 24, 2023

Uh oh!

devmotion commented May 24, 2023

Uh oh!

devmotion commented May 24, 2023

Uh oh!

FelixBenning commented May 24, 2023

Uh oh!

FelixBenning commented May 24, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Improve Performance of `reduce(hcat, ::ColVecs)` and `reduce(vcat, ::ColVecs)` by specialization #510

Improve Performance of `reduce(hcat, ::ColVecs)` and `reduce(vcat, ::ColVecs)` by specialization #510

codecov bot commented May 24, 2023 •

edited

Loading

FelixBenning commented May 24, 2023 •

edited

Loading

devmotion commented May 24, 2023 •

edited

Loading