Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize completecases to process only missingable columns #2726

Merged
merged 7 commits into from
May 7, 2021

Conversation

pstorozenko
Copy link
Contributor

Optimization to two completecases methods by only processing missingable columns.
Discussed in #2724

@bkamins
Copy link
Member

bkamins commented Apr 17, 2021

Looks good. The comments are minor.
Can you please also post some benchmarks and double check that we properly cover all the possible cases in tests.

@pstorozenko
Copy link
Contributor Author

pstorozenko commented Apr 17, 2021

While benchmarking, I've found a much bigger problem.

res .&= .!ismissing.(df[!, i])

I suspect that this line is not aware of df[!, i] type and .!ismissing.(df[!, i]) is slow.


Few benchamarks:

df = DataFrame(
    a = rand(1000000), b = rand(Int, 1000000),
    c = allowmissing(rand(1000000)), d = repeat([1:9; missing], 100000)
)

main branch timings:

ulia> @btime completecases(df);
  5.570 ms (32 allocations: 139.50 KiB)
julia> @btime completecases(df, :a);
  147.885 μs (9 allocations: 126.52 KiB)
julia> @btime completecases(df, :b);
  147.793 μs (9 allocations: 126.52 KiB)
julia> @btime completecases(df, :c);
  158.253 μs (9 allocations: 126.52 KiB)
julia> @btime completecases(df, :d);
  158.959 μs (9 allocations: 126.52 KiB)

Timings for

res = trues(size(df, 1))
for i in 1:size(df, 2)
    v = df[!, i]
    if Missing <: eltype(v)
      res .&= .!ismissing.(v)
    end
end

and

function completecases(df::AbstractDataFrame, col::ColumnIndex)
    v = df[!, col]
    if Missing <: eltype(v)
        return .!ismissing.(v)
    else
        return trues(size(df, 1))
    end
end
julia> @btime completecases(df);
  3.148 ms (18 allocations: 130.88 KiB)
julia> @btime completecases(df, :a);
  12.735 μs (4 allocations: 122.25 KiB)
julia> @btime completecases(df, :b);
  12.269 μs (4 allocations: 122.25 KiB)
julia> @btime completecases(df, :c);
  158.393 μs (9 allocations: 126.52 KiB)
julia> @btime completecases(df, :d);
  159.116 μs (9 allocations: 126.52 KiB)

Timing for

for i in 1:size(df, 2)
    v = df[!, i]
    if Missing <: eltype(v)
      res .&= aux(v)
    end
end

and

aux(v) = .!ismissing.(v)
julia> @btime completecases(df);
  378.161 μs (16 allocations: 375.22 KiB)

To sum up, optimization with processing only missingable columns brought expected gain in performance.
However, more important is that .!ismissing.(v) is not aware of v type and is slow.
By making a simple barrier we gain a lot in time but some memory is allocated.
Maybe my suspicions are wrong, but there is a problem nevertheless.
Do you have any ideas on how to remove these allocations?

@bkamins
Copy link
Member

bkamins commented Apr 18, 2021

Good catch. However, it is not related to function barrier. The core of the problem is the following I think:

julia> function f(x)
       r = trues(length(x))
       r .&= .!ismissing.(x)
       return r
       end
f (generic function with 1 method)

julia> function g(x)
       r = trues(length(x))
       y = .!ismissing.(x)
       r .&= y
       return r
       end
g (generic function with 1 method)

julia> x = rand([1, missing], 10^6);

julia> @benchmark f($x)
BenchmarkTools.Trial:
  memory estimate:  126.42 KiB
  allocs estimate:  4
  --------------
  minimum time:     1.214 ms (0.00% GC)
  median time:      1.247 ms (0.00% GC)
  mean time:        1.339 ms (0.19% GC)
  maximum time:     5.126 ms (75.65% GC)
  --------------
  samples:          3722
  evals/sample:     1

julia> @benchmark g($x)
BenchmarkTools.Trial:
  memory estimate:  248.66 KiB
  allocs estimate:  7
  --------------
  minimum time:     136.100 μs (0.00% GC)
  median time:      146.400 μs (0.00% GC)
  mean time:        166.911 μs (3.43% GC)
  maximum time:     8.322 ms (97.25% GC)
  --------------
  samples:          10000
  evals/sample:     1

The most likely reason is that we use here BitVector which is not handled optimally in such loops. We could switch to Vecror{Bool} but I think it will be slower, and it is better to use your approach using operation splitting.

However, then it should be better to allocate an aux vector once and and update it in-place (so we allocate both res and aux only once).

@mbauman - the problem with broadcast fusion performance that we have here probably cannot be helped and we have to resolve it manually - right?

`df[!, i]` extracted to variable
 `res .= .!ismissing.(v)` splited for performance into two lines
@pstorozenko
Copy link
Contributor Author

As you suggested I've added and preallocated aux vector.
Same benchmarks as yesterday.
The first is for res .&= .!ismissing.(v), the second for aux .= .!ismissing.(v); res .&= aux.

julia> @btime completecases(df); 
  3.154 ms (18 allocations: 130.88 KiB)
  
julia> @btime completecases(df);
  375.694 μs (18 allocations: 253.00 KiB)

@bkamins
Copy link
Member

bkamins commented Apr 18, 2021

Great. Looks good. Could you please add tests to make sure we cover all possible scenarios properly? Thank you!

@bkamins bkamins added this to the 1.x milestone Apr 18, 2021
@bkamins
Copy link
Member

bkamins commented May 2, 2021

@pstorozenko - do you plan to work on this? (no rush, but usually stalled PRs get outdated and it is hard to get back to them after some time). Thank you!

@pstorozenko
Copy link
Contributor Author

Yes, but you're right, thanks for pinging. I had to think for a while, what needs testing.
I added tests for cases I could think of.

test/data.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member

bkamins commented May 2, 2021

Looks good. I have edited the tests a bit.

@pstorozenko
Copy link
Contributor Author

Sure thing, I tried to mimic the design of the rest of tests, but a little refactor is always good.
Thanks!

@bkamins
Copy link
Member

bkamins commented May 2, 2021

I tried to mimic the design of the rest of tests

I noticed, but these tests were written very long time ago and I thought to clean them up a bit.

test/data.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member

bkamins commented May 3, 2021

I pushed one additional fix that cleans inference issues.

Comment on lines +779 to +781
res = BitVector(undef, size(df, 1))
res .= .!ismissing.(v)
return res
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just this?

Suggested change
res = BitVector(undef, size(df, 1))
res .= .!ismissing.(v)
return res
return .!ismissing.(v)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it is not type stable. I have just reversed this. The current design (with res) has no performance penalty, but pasess @inferred.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's surprising. How about just adding ::BitVector in the function definition? That would make it clearer that the goal is to fix inference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you would ask about it :).

The ::BitVector annotation could potentially allocate once more by converting Vector{Bool} to BitVector while what I do should guarantee only that only one allocation happens because broadcasting assignment does copyto! of Broadcasted into target.

Here is an example:

julia> f1(x::AbstractVector)::BitVector = .!ismissing.(x)
f1 (generic function with 1 method)

julia> f2(x::AbstractVector) = (res = BitVector(undef, length(x)); res .= .!ismissing.(x); return x)
f2 (generic function with 1 method)

julia> using SparseArrays

julia> x = sparse(1:10^7);

julia> using BenchmarkTools

julia> f1(x::AbstractVector)::BitVector = .!ismissing.(x)
f1 (generic function with 1 method)

julia> @btime f1($x);
  568.808 ms (7 allocations: 87.02 MiB)

julia> @btime f2($x);
  525.133 ms (4 allocations: 1.20 MiB)

test/data.jl Show resolved Hide resolved
@bkamins
Copy link
Member

bkamins commented May 5, 2021

@nalimilan - any more comments on this?

@bkamins
Copy link
Member

bkamins commented May 5, 2021

@pstorozenko - please review the PR (as it was updated by me). If you can think of something to change (e.g. some more tests of corner caes) please do add them. Otherwise let me know we are done and I think we are good to merge this.

@bkamins bkamins merged commit bcaa2e5 into JuliaData:main May 7, 2021
@bkamins
Copy link
Member

bkamins commented May 7, 2021

Thank you!

@pstorozenko pstorozenko deleted the ps/ccases_onlymissing branch May 25, 2021 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants