-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matchmissing == :notequal #2724
Conversation
So things to do to improve the performance are the following (this is a general comment without running of the code):
(these changes should probably go in separate PRs to make sure we can pin-point the performance changes introduced by them) Regarding memory profiling - I will run my test and report here later. |
Regarding the performance of view vs copy the offending line is:
When doing Therefore a tentative conclusion is:
Also maybe we should just always use a In summary:
|
Just to be clear on my position - I have "Killed" my Julia session several times when benchmarking with "copy" option, while "view" option worked each time. |
@pstorozenko - please let me know on which things listed here you would be willing to work, so that I can plan better the development. Thank you! Also in general after 1.0 release of DataFrames.jl there are four streams of potential things that maybe you would be interested in to contribute after we are done with this PR/series of PRs:
|
I'd propose the following:
I'll do PRs for those two.
Do you have any suggestions on naming a custom
In this PR I'll do some quick benchmarks on number of columns and then write solution with mixed view / copy approach for
I think, we can benchmark it later, but I'll not do it now.
Thanks a lot! We will see later. |
is OK, but do not do this this way, as it will add compilation latency without any benefit. better just iterate columns of
maybe just
OK, just please keep in mind that I prefer a solution that is not memory hungry in general. Also note when benchmarking that your original benchmark was kind of "worst case" as the join produced many more rows than the input tables because of the way how you generated data. This is not a typical scenario (normally one expects no dupilcates or at most few duplicates).
Yes - this is something that is probably not going to give much benefit. The point is to avoid materializing Thank you! |
With #2726 optimization to |
The corner case is when there are no Also as a general comment - all my proposals are speculative, i.e. I have not implemented and benchmarked them. We will make these changes if we can see the benefit. |
Given #2727 is merged could you please review what should be done in this PR? Thank you! |
Yes, I'll look on it today. |
I've tested versions in a more realistic scenario as you suggested. Version with view is faster here and allocates less memory, so let's leave this version. posts = GZip.open("match/Posts.csv.gz") do f
CSV.read(f, DataFrame)
end
votes = GZip.open("match/Votes.csv.gz") do f
CSV.read(f, DataFrame)
end
function testp(posts, votes, p)
Nv = nrow(votes)
v2 = copy(votes)
allowmissing!(v2, :PostId)
iv = sample(1:Nv, trunc(Int, Nv * p), replace=false)
v2[iv, :PostId] .= missing
Np = nrow(posts)
p2 = copy(posts)
allowmissing!(p2, :Id)
ip = sample(1:Np, trunc(Int, Np * p), replace=false)
p2[ip, :Id] .= missing
@btime innerjoin($p2, $v2, on = :Id => :PostId, makeunique = true, matchmissing = :notequal_view);
@btime innerjoin($p2, $v2, on = :Id => :PostId, makeunique = true, matchmissing = :notequal_copy);
end
for p in [0.0, 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 0.75]
@show p
testp(posts, votes, p);
end
|
Thank you for checking this. Given your past PRs you probably know what needs to be done (please let me know if you are willing to do this):
Thank you! |
Sure thing! |
Well - I used to tell people to rebase, squash to one commit and force push the changes (this is what I normally do if I push a rewrite; merging is preferable if there are significant comments to the implementation pending in the old code, but in this case we do not have such, as we have discussed the design already, so I need to anyway just review the final implementation). However, most contributors had problems with following this workflow so I started suggesting merging, which is usually simpler to handle properly in git 😄. In short - if you rebased this is preferable. |
Not documented, available for testing purpose
We stay with view Some tests added NEWS updated
3b85d02
to
432560f
Compare
Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it it is not a draft any more :).
@nalimilan - the PR is ready for you to have a look at. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Just a few minor points.
@nalimilan - thank you for the review. You always have a keen eye for the details. |
also some minor docs fixes
Remove types where not needed Remove spaces around = Align lines better
copycols=true->false Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>
Is there a way of telling |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
No. If you have unreachable code (I have not looked at your commits yet) and want to make sure it is not reached throw an error there. See e.g. DataFrames.jl/src/join/core.jl Line 562 in 67ccc07
|
The PR looks good to me (if @nalimilan is OK with the code formatting, especially in the test section) |
Thank you for reviews! |
Thank you! |
Not documented, available for testing purpose.
As discussed in #2650 I create PR with my changes.