-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize completecases
to process only missingable columns
#2726
Conversation
Looks good. The comments are minor. |
While benchmarking, I've found a much bigger problem.
I suspect that this line is not aware of df[!, i] type and .!ismissing.(df[!, i]) is slow.
Few benchamarks: df = DataFrame(
a = rand(1000000), b = rand(Int, 1000000),
c = allowmissing(rand(1000000)), d = repeat([1:9; missing], 100000)
) main branch timings:
Timings for res = trues(size(df, 1))
for i in 1:size(df, 2)
v = df[!, i]
if Missing <: eltype(v)
res .&= .!ismissing.(v)
end
end and function completecases(df::AbstractDataFrame, col::ColumnIndex)
v = df[!, col]
if Missing <: eltype(v)
return .!ismissing.(v)
else
return trues(size(df, 1))
end
end
Timing for for i in 1:size(df, 2)
v = df[!, i]
if Missing <: eltype(v)
res .&= aux(v)
end
end and aux(v) = .!ismissing.(v)
To sum up, optimization with processing only missingable columns brought expected gain in performance. |
Good catch. However, it is not related to function barrier. The core of the problem is the following I think:
The most likely reason is that we use here However, then it should be better to allocate an @mbauman - the problem with broadcast fusion performance that we have here probably cannot be helped and we have to resolve it manually - right? |
`df[!, i]` extracted to variable `res .= .!ismissing.(v)` splited for performance into two lines
As you suggested I've added and preallocated
|
Great. Looks good. Could you please add tests to make sure we cover all possible scenarios properly? Thank you! |
@pstorozenko - do you plan to work on this? (no rush, but usually stalled PRs get outdated and it is hard to get back to them after some time). Thank you! |
Yes, but you're right, thanks for pinging. I had to think for a while, what needs testing. |
Looks good. I have edited the tests a bit. |
Sure thing, I tried to mimic the design of the rest of tests, but a little refactor is always good. |
I noticed, but these tests were written very long time ago and I thought to clean them up a bit. |
I pushed one additional fix that cleans inference issues. |
res = BitVector(undef, size(df, 1)) | ||
res .= .!ismissing.(v) | ||
return res |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just this?
res = BitVector(undef, size(df, 1)) | |
res .= .!ismissing.(v) | |
return res | |
return .!ismissing.(v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because it is not type stable. I have just reversed this. The current design (with res
) has no performance penalty, but pasess @inferred
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's surprising. How about just adding ::BitVector
in the function definition? That would make it clearer that the goal is to fix inference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought you would ask about it :).
The ::BitVector
annotation could potentially allocate once more by converting Vector{Bool}
to BitVector
while what I do should guarantee only that only one allocation happens because broadcasting assignment does copyto!
of Broadcasted
into target.
Here is an example:
julia> f1(x::AbstractVector)::BitVector = .!ismissing.(x)
f1 (generic function with 1 method)
julia> f2(x::AbstractVector) = (res = BitVector(undef, length(x)); res .= .!ismissing.(x); return x)
f2 (generic function with 1 method)
julia> using SparseArrays
julia> x = sparse(1:10^7);
julia> using BenchmarkTools
julia> f1(x::AbstractVector)::BitVector = .!ismissing.(x)
f1 (generic function with 1 method)
julia> @btime f1($x);
568.808 ms (7 allocations: 87.02 MiB)
julia> @btime f2($x);
525.133 ms (4 allocations: 1.20 MiB)
@nalimilan - any more comments on this? |
@pstorozenko - please review the PR (as it was updated by me). If you can think of something to change (e.g. some more tests of corner caes) please do add them. Otherwise let me know we are done and I think we are good to merge this. |
Thank you! |
Optimization to two
completecases
methods by only processing missingable columns.Discussed in #2724