Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Performance of reduce(hcat, ::ColVecs) and reduce(vcat, ::ColVecs) by specialization #510

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

FelixBenning
Copy link

Summary

The idea is that

reduce(hcat, ColVecs(X))

should in principle be a no-op.

Proposed changes
Specialization of reduce on typeof(hcat) and typeof(vcat).

The drawback is more code. But I added test coverage and it was something very easy to do 🤷 . Whether you want this or not is up to you.

@codecov
Copy link

codecov bot commented May 24, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: -3.71 ⚠️

Comparison is base (ef6d459) 94.16% compared to head (387837f) 90.45%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #510      +/-   ##
==========================================
- Coverage   94.16%   90.45%   -3.71%     
==========================================
  Files          52       52              
  Lines        1387     1393       +6     
==========================================
- Hits         1306     1260      -46     
- Misses         81      133      +52     
Impacted Files Coverage Δ
src/utils.jl 92.04% <100.00%> (+0.58%) ⬆️

... and 8 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@willtebbutt willtebbutt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please bump the patch version, then we can merge when CI passes

@FelixBenning
Copy link
Author

I have no idea why the tests would fail - something about constant allocation heuristics in Zygote?

Copy link
Member

@devmotion devmotion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think similar definitions should be added in this PR for RowVecs for consistency.

src/utils.jl Outdated Show resolved Hide resolved
@willtebbutt
Copy link
Member

I have no idea why the tests would fail - something about constant allocation heuristics in Zygote?

Is it clear whether they're happening in areas where the modifications introduced here cause problems, or do they appear completely spurious?

Co-authored-by: David Widmann <devmotion@users.noreply.github.com>
@FelixBenning
Copy link
Author

they definitely appear completely spurious. The area seems completely unrelated. But I also don't understand the code in that area to be honest

@FelixBenning
Copy link
Author

FelixBenning commented May 24, 2023

I think similar definitions should be added in this PR for RowVecs for consistency.

If you hcat RowVecs you would expect something else

A = [ a11 ... a1n;
      ...
      am1 ... amn]
RowVecs(A) == [ [a11 ... a1n], ... [am1, ... amn] ]

So

reduce(hcat, RowVecs(A)) == [
    a11 ... am1
    ...
    a1n ... 1mn
] # == A'

But if you have to transpose A' you have to move all of the memory around, which is probably already equivalent to the existing reduce(hcat, ::AbstractVector)

That is why I did not add it (because I don't know if that would actually result in a performance benefit. RowVecs definitely go against the grain of julias column major matrices.)

But if you believe this is worth the additional lines of code, I can add these definitions for RowVecs :)

@devmotion
Copy link
Member

But if you have to transpose A' you have to move all of the memory around, which is probably already equivalent to the existing reduce(hcat, ::AbstractVector)

It's defined here: https://github.com/JuliaLang/julia/blob/310f59019856749fb85bc56a1e3c2e0592a134ad/base/abstractarray.jl#L1638-L1701
Benchmarking would be nice but I would suspect that reduce(::typeof(hcat), x::RowVecs) = permutedims(x.X) would be easier for the compiler. Similarly you could implement e.g. reduce(::typeof(vcat), x::RowVecs) = copy(vec(PermutedDimsArray(x.X, (2, 1)))) in a way that does not compute any actual transposes but only uses views.

RowVecs definitely go against the grain of julias column major matrices.

Many stats packages prefer this format though, I assume at least partly due to consistency with R, Python, and dataframes/datasets in general. Often it also doesn't matter what input format users use - internally you can always use the most performant algorithm.

@devmotion
Copy link
Member

they definitely appear completely spurious.

Just from looking at the failing tests and logs, I would assume that the test failures are real: They compare the allocations in the pullbacks of transforms that IMO are likely to involve reduce for arguments of type ColVecs and RowVecs. So optimizing ColVecs but not RowVecs could likely explain the reduced allocation for ColVecs (whereas allocations for RowVecs are not affected).

This could be another argument for optimizing RowVecs as well. But I assume that even optimized RowVecs code might still lead to more allocations than ColVecs.

@FelixBenning
Copy link
Author

How do you guys do benchmarks locally? If I want to use BenchmarkTools I need to add the package (but I don't want to add it to KernelFunctions.jl). If I ] activate test I don't have access to KernelFunctions anymore. ] test works but it seems to do something special... I mean I can add BenchmarkTools I guess and just not commit that change. But that seems contrived

@devmotion
Copy link
Member

I often use temporary environments. Changing Project.toml and/or Manifest.toml for benchmarking is not a good idea IMO.

@FelixBenning
Copy link
Author

julia> @benchmark invoke(reduce, Tuple{typeof(hcat), AbstractVector},  hcat, ColVecs($D))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  10.375 μs   10.470 ms  ┊ GC (min  max):  0.00%  99.78%
 Time  (median):     18.125 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   29.337 μs ± 269.440 μs  ┊ GC (mean ± σ):  33.74% ±  3.89%

    ▁       ▂▄▅▄▆▆▇█▆▆▄▂▁                                       
  ▃▆███▇▆▆▇████████████████▇▆▅▅▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▂▂▂ ▄
  10.4 μs         Histogram: frequency by time         38.9 μs <

 Memory estimate: 86.12 KiB, allocs estimate: 198.

julia> @benchmark invoke(reduce, Tuple{typeof(hcat), ColVecs},  hcat, ColVecs($D))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min  max):  2.208 ns  23.917 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.375 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.388 ns ±  0.593 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                   █                          
  ▂▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁█▅▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▂ ▂
  2.21 ns        Histogram: frequency by time         2.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark invoke(reduce, Tuple{typeof(hcat), AbstractVector},  hcat, RowVecs($D))
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (min  max):   7.073 μs   1.815 ms  ┊ GC (min  max):  0.00%  95.50%
 Time  (median):     11.677 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   16.796 μs ± 73.342 μs  ┊ GC (mean ± σ):  27.95% ±  6.60%

   ▁     ▄▄▁▂▅█▅▂                                              
  ▆██▅▄▄██████████▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂ ▃
  7.07 μs         Histogram: frequency by time        31.9 μs <

 Memory estimate: 57.86 KiB, allocs estimate: 128.

julia> @benchmark invoke(reduce, Tuple{typeof(hcat), RowVecs},  hcat, RowVecs($D))
BenchmarkTools.Trial: 10000 samples with 247 evaluations.
 Range (min  max):  302.126 ns  121.268 μs  ┊ GC (min  max):  0.00%  99.45%
 Time  (median):     679.486 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):     1.117 μs ±   2.958 μs  ┊ GC (mean ± σ):  34.72% ± 14.37%

   █                                                             
  ▅█▆▂▂▂▂▂▂▁▂▂▂▂▂▂▁▁▂▁▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▂▂▂▁▂▂▁▁▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▂
  302 ns           Histogram: frequency by time         16.1 μs <

 Memory estimate: 4.81 KiB, allocs estimate: 1.

Interestingly it does help to implement the RowVersion, but it is still slower than the col version which just does nothing

@devmotion
Copy link
Member

devmotion commented May 24, 2023

Well, that's exactly what I expected, so I'm not surprised. (BTW I would directly benchmark the functions of interest to avoid measuring invoke as well.)

@FelixBenning
Copy link
Author

To measure the performance of the fallback I would need to use invoke no? And to make it fair it should use it everywhere. That was my reasoning at least.

@devmotion
Copy link
Member

You could just run the same benchmark on the master branch and on the PR.

@devmotion
Copy link
Member

That's also safer in so far as you don't have to reason about which methods are called - maybe there's some specialization for AbstractVector{<:Vector} somewhere else that is used on the master branch instead of the method you benchmarked?

@FelixBenning
Copy link
Author

So if I understand correctly here the number of allocations are counted:

      kernelmatrix (binary): Test Failed at /home/runner/work/KernelFunctions.jl/KernelFunctions.jl/test/test_utils.jl:387
  Expression: pb[1] == pb[2]
   Evaluated: 504 == 503

and since ColVecs is more efficient now, it has one allocation less? I am not sure how to fix those tests though to be honest.

@FelixBenning
Copy link
Author

You could just run the same benchmark on the master branch and on the PR.

new

julia> @benchmark reduce(hcat, ColVecs(data)) setup=(data=rand(10,20))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min  max):  2.125 ns  81.958 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.292 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.343 ns ±  0.997 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                     ▃     █      ▁     ▁                     
  ▂▁▁▁▁▁▂▁▁▁▁▁▃▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▄▁▁▁▁▁▃▁▁▁▁▁▂ ▂
  2.12 ns        Histogram: frequency by time         2.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reduce(hcat, RowVecs(data)) setup=(data=rand(10,20))
BenchmarkTools.Trial: 10000 samples with 932 evaluations.
 Range (min  max):  125.850 ns    5.302 μs  ┊ GC (min  max):  0.00%  96.04%
 Time  (median):     142.212 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   201.706 ns ± 243.362 ns  ┊ GC (mean ± σ):  19.27% ± 14.85%

  █▅▃▁▄▄▃ ▁                                                     ▁
  ██████████▆▅▅▅▃▃▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▆▇████ █
  126 ns        Histogram: log(frequency) by time       1.41 μs <

 Memory estimate: 1.77 KiB, allocs estimate: 1.

vs old

julia> @benchmark reduce(hcat, ColVecs(data)) setup=(data=rand(10,20))
BenchmarkTools.Trial: 10000 samples with 258 evaluations.
 Range (min  max):  296.349 ns    5.797 μs  ┊ GC (min  max): 0.00%  92.52%
 Time  (median):     315.244 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   333.317 ns ± 142.009 ns  ┊ GC (mean ± σ):  3.90% ±  7.70%

  █▇▅▁                                                          ▁
  ████▇██▆▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃ █
  296 ns        Histogram: log(frequency) by time        1.5 μs <

 Memory estimate: 1.77 KiB, allocs estimate: 1.

julia> @benchmark reduce(hcat, RowVecs(data)) setup=(data=rand(10,20))
BenchmarkTools.Trial: 10000 samples with 343 evaluations.
 Range (min  max):  259.233 ns    5.354 μs  ┊ GC (min  max): 0.00%  92.93%
 Time  (median):     276.239 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   297.589 ns ± 161.529 ns  ┊ GC (mean ± σ):  5.91% ±  9.18%

  █▅▁                                                           ▁
  ███▇▆▅▇█▆▅▃▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▃▆▆ █
  259 ns        Histogram: log(frequency) by time       1.58 μs <

 Memory estimate: 1.77 KiB, allocs estimate: 1.

src/utils.jl Outdated Show resolved Hide resolved
src/utils.jl Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants