Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicit loop in _findall to avoid allocations #2771

Merged
merged 4 commits into from
May 25, 2021

Conversation

pstorozenko
Copy link
Contributor

While testing #2724 and looking at #2769 benchmarks I've seen that sometimes _findall allocates a lot and has some weird overhead in memory. After digging in, changing
I[i:i+64-1] .= i1:(i1+64-1)
into

for j in 0:64-1
    I[i + j] = i1 + j 
end

removed unnecessary allocations, and sped up execution a bit.
I'll post benchmarks tommorow.

@bkamins
Copy link
Member

bkamins commented May 23, 2021

This is strange. @mbauman - broadcasting in these cases should be fast - right? The only potential benefit should be that we avoid extra checking of dimensions of the broadcasted objects, but no allocations should be made I think.

Still we might want to remove it to reduce compilation latency, but I would like to understand the reason of the problem with broadcasting.

@bkamins bkamins added this to the patch milestone May 23, 2021
@pstorozenko
Copy link
Contributor Author

As problem occurred when dealing with trues blocks, here are benchmarks that we see difference in:

BD = OrderedDict(
  "TFT small" => [trues(85); falses(100); trues(85)],
  "FTFFT small" => [falses(64 + 32); trues(32); falses(128); trues(32)],
  "TFTF small" => [falses(64); trues(64); falses(64); trues(64)],
  "TFT small2" => [trues(64); falses(10); trues(100)],

  "TFT Big" => [trues(8500); falses(100000); trues(65000)],
  "FTFTFTF Big" => [falses(65000); trues(65000); falses(65000); trues(65000); falses(65000); trues(65000); falses(65000)],

  "FTFR small" => [falses(85); trues(100); falses(65); rand([true, false], 20)],
  "RT Big" => [BitVector(rand([true, false], 100000)) ; trues(100000)],
  "FTFR Big" => [falses(65000);  trues(65000);  falses(65000); rand([true, false], 20000)],
  "T256 R100" => [trues(256);  rand([true, false], 100)],
);

for (l, B) in BD
  println("Processing $l")
  @show Base.findall(B) == DataFrames._findall(B)
  @btime Base.findall($B)
  @btime DataFrames._findall($B)
end

This PR:

Processing TFT small
Base.findall(B) == DataFrames._findall(B) = true
  260.224 ns (1 allocation: 1.45 KiB)
  128.176 ns (1 allocation: 1.45 KiB)
Processing FTFFT small
Base.findall(B) == DataFrames._findall(B) = true
  129.116 ns (1 allocation: 624 bytes)
  92.531 ns (1 allocation: 624 bytes)
Processing TFTF small
Base.findall(B) == DataFrames._findall(B) = true
  210.670 ns (1 allocation: 1.14 KiB)
  101.724 ns (1 allocation: 1.14 KiB)
Processing TFT small2
Base.findall(B) == DataFrames._findall(B) = true
  259.309 ns (1 allocation: 1.45 KiB)
  158.026 ns (1 allocation: 1.45 KiB)
Processing TFT Big
Base.findall(B) == DataFrames._findall(B) = true
  94.849 μs (2 allocations: 574.33 KiB)
  70.811 μs (2 allocations: 574.33 KiB)
Processing FTFTFTF Big
Base.findall(B) == DataFrames._findall(B) = true
  251.696 μs (2 allocations: 1.49 MiB)
  217.889 μs (2 allocations: 1.49 MiB)
Processing FTFR small
Base.findall(B) == DataFrames._findall(B) = true
  155.836 ns (1 allocation: 1008 bytes)
  101.939 ns (1 allocation: 1008 bytes)
Processing RT Big
Base.findall(B) == DataFrames._findall(B) = true
  203.891 μs (2 allocations: 1.15 MiB)
  172.991 μs (2 allocations: 1.15 MiB)
Processing FTFR Big
Base.findall(B) == DataFrames._findall(B) = true
  100.210 μs (2 allocations: 585.95 KiB)
  75.266 μs (2 allocations: 585.95 KiB)
Processing T256 R100
Base.findall(B) == DataFrames._findall(B) = true
  699.430 ns (1 allocation: 2.62 KiB)
  480.396 ns (1 allocation: 2.62 KiB)

Main:

Processing TFT small
Base.findall(B) == DataFrames._findall(B) = true
  259.325 ns (1 allocation: 1.45 KiB)
  219.265 ns (5 allocations: 1.61 KiB)
Processing FTFFT small
Base.findall(B) == DataFrames._findall(B) = true
  127.097 ns (1 allocation: 624 bytes)
  138.403 ns (3 allocations: 704 bytes)
Processing TFTF small
Base.findall(B) == DataFrames._findall(B) = true
  211.937 ns (1 allocation: 1.14 KiB)
  194.858 ns (5 allocations: 1.30 KiB)
Processing TFT small2
Base.findall(B) == DataFrames._findall(B) = true
  263.512 ns (1 allocation: 1.45 KiB)
  215.591 ns (3 allocations: 1.53 KiB)
Processing TFT Big
Base.findall(B) == DataFrames._findall(B) = true
  94.840 μs (2 allocations: 574.33 KiB)
  84.893 μs (2032 allocations: 653.62 KiB)
Processing FTFTFTF Big
Base.findall(B) == DataFrames._findall(B) = true
  251.978 μs (2 allocations: 1.49 MiB)
  231.112 μs (4062 allocations: 1.64 MiB)
Processing FTFR small
Base.findall(B) == DataFrames._findall(B) = true
  188.253 ns (1 allocation: 1008 bytes)
  158.367 ns (3 allocations: 1.06 KiB)
Processing RT Big
Base.findall(B) == DataFrames._findall(B) = true
  203.867 μs (2 allocations: 1.15 MiB)
  193.127 μs (3126 allocations: 1.27 MiB)
Processing FTFR Big
Base.findall(B) == DataFrames._findall(B) = true
  100.070 μs (2 allocations: 585.83 KiB)
  76.974 μs (4 allocations: 585.91 KiB)
Processing T256 R100
Base.findall(B) == DataFrames._findall(B) = true
  684.395 ns (1 allocation: 2.56 KiB)
  497.782 ns (3 allocations: 2.64 KiB)

@pstorozenko
Copy link
Contributor Author

It is weird for me as I checked at some point number of allocations with I[i:i+64-1] .= i1:(i1+64-1) and it was zero when working on _findall.
My theory is that range i1:(i1+64-1) is collected here and then later assigned but I might be totally wrong.

@bkamins
Copy link
Member

bkamins commented May 23, 2021

I[i:i+64-1] .= i1:(i1+64-1) should be zero allocations and not collect anything. That is why I have pinged @mbauman.

@bkamins bkamins changed the title Unroll loop in _findall to avoid allocations Explicit loop in _findall to avoid allocations May 25, 2021
@bkamins bkamins merged commit 8ce4d2a into JuliaData:main May 25, 2021
@bkamins
Copy link
Member

bkamins commented May 25, 2021

Thank you. I have reviewed it again and all looks good.

@pstorozenko
Copy link
Contributor Author

Thanks, do we know what is the cause of allocations?

@pstorozenko pstorozenko deleted the ps/_findall_alloc branch May 25, 2021 14:06
@bkamins
Copy link
Member

bkamins commented May 25, 2021

No. It is very strange. If I use Revise.jl and modify the body of the function in an unrelated place then in your old code I go from:

DataFrames._findall(B) = true
  219.265 ns (5 allocations: 1.61 KiB)

to 1 allocation (as in your code with broadcasting replaced by a loop). So it seems that for some reason the compiler gets confused and is not generating optimal code.

@vtjnash - would you be interested in reporting the details? (for our use case it is a minor issue so merged the PR, but maybe for some reason you would find it useful to have reproduced)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants