Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matchmissing == :notequal #2724

Merged
merged 12 commits into from
Jun 3, 2021
Merged

Conversation

pstorozenko
Copy link
Contributor

Not documented, available for testing purpose.

As discussed in #2650 I create PR with my changes.

src/join/composer.jl Outdated Show resolved Hide resolved
src/join/composer.jl Outdated Show resolved Hide resolved
src/join/composer.jl Outdated Show resolved Hide resolved
src/join/composer.jl Outdated Show resolved Hide resolved
src/join/composer.jl Outdated Show resolved Hide resolved
src/join/composer.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member

bkamins commented Apr 17, 2021

So things to do to improve the performance are the following (this is a general comment without running of the code):

  1. in completecases(df::AbstractDataFrame, col::Colon=:) process only columns that potentially allow missing values
  2. in dropmissing and dropmissing! do not invoke completecases if no considered columns potentially allow for missing
  3. in SubDataFrame(parent::DataFrame, rows::AbstractVector{Bool}, cols) first check if rows creates one continuous block of rows; if yes then instead of findall use a range to avoid allocations (actually probably a custom findall should be written to do it in one shot)

(these changes should probably go in separate PRs to make sure we can pin-point the performance changes introduced by them)

Regarding memory profiling - I will run my test and report here later.

@bkamins
Copy link
Member

bkamins commented Apr 17, 2021

Regarding the performance of view vs copy the offending line is:

parent(sdf)[rows(sdf)[rowinds], parentcols(index(sdf), colinds)]

When doing getindex in a SubDataFrame we allocate one temporary vector (both for left and right) that has the size comparable to a data frame. So we do "as if" we had 2 extra columns.

Therefore a tentative conclusion is:

  1. in the dropmissing even if we use view we should check if actually any rows were dropped and if not pass through the original DataFrame (if it were a DataFrame already) - this will save some time.
  2. we probably should branch conditional on the number of columns in a data frame (the best cut-off should be identified by benchmarking)
    • if number of columns is low use a copy
    • if number of columns is high use a view (as otherwise it is extremely easy to "Kill" the process due to out-of-memory)
  3. an even more advanced optimization would be to split line
    dfl = joiner.dfl[left_ixs, :]
    into two parts: left_noon and left_on (as left_on is always materialized) - but I am not sure if it is worth the effort

Also maybe we should just always use a view if benchmarking shows that the speed differences are not very big, as joins are very memory hungry, so in this case it is better not to do too much temporary allocations.

In summary:

  1. in the first round I would go for the view option - even if we know it is slower - it is safer (or a mix of copy and view with low threshold for column count when we switch to view)
  2. then in consecutive PRs work on improving the performance (and my experience is that these things are usually a deep hole where one optimization leads to further requests for changes, but this is how it works :))

@bkamins
Copy link
Member

bkamins commented Apr 17, 2021

Just to be clear on my position - I have "Killed" my Julia session several times when benchmarking with "copy" option, while "view" option worked each time.

@bkamins
Copy link
Member

bkamins commented Apr 17, 2021

@pstorozenko - please let me know on which things listed here you would be willing to work, so that I can plan better the development. Thank you!

Also in general after 1.0 release of DataFrames.jl there are four streams of potential things that maybe you would be interested in to contribute after we are done with this PR/series of PRs:

  • performance (things like discussed in this PR + improving multi threading support); here - as commented - PRs should be small and focus on a single thing that is improved;
  • adding new functionality (here the key discussions are usually API design considerations);
  • documentation improvements;
  • test coverage improvement.

@pstorozenko
Copy link
Contributor Author

I'd propose the following:

  1. in completecases(df::AbstractDataFrame, col::Colon=:) process only columns that potentially allow missing values
  2. in dropmissing and dropmissing! do not invoke completecases if no considered columns potentially allow for missing

I'll do PRs for those two.
Is there a nicer way of getting a vector of missingable columns than Missing .<: eltype.(eachcol(df))?

  1. in SubDataFrame(parent::DataFrame, rows::AbstractVector{Bool}, cols) first check if rows creates one continuous block of rows; if yes then instead of findall use a range to avoid allocations (actually probably a custom findall should be written to do it in one shot)

Do you have any suggestions on naming a custom findall function and it's full signature?
I'll to PR then.

Therefore a tentative conclusion is: ...

In this PR I'll do some quick benchmarks on number of columns and then write solution with mixed view / copy approach for :notequal as well as write documentation and tests.

an even more advanced optimization would be to split line

I think, we can benchmark it later, but I'll not do it now.


Also in general after 1.0 release of DataFrames.jl there are four streams of potential things that maybe you would be interested in to contribute after we are done with this PR/series of PRs:

Thanks a lot! We will see later.

@bkamins
Copy link
Member

bkamins commented Apr 17, 2021

Missing .<: eltype.(eachcol(df))

is OK, but do not do this this way, as it will add compilation latency without any benefit. better just iterate columns of df and compare their eltype to Missing.

Do you have any suggestions on naming a custom findall function and it's full signature?

maybe just _findall(v::AbstractVector{Bool}))

In this PR I'll do some quick benchmarks on number of columns and then write solution with mixed view / copy approach

OK, just please keep in mind that I prefer a solution that is not memory hungry in general. Also note when benchmarking that your original benchmark was kind of "worst case" as the join produced many more rows than the input tables because of the way how you generated data. This is not a typical scenario (normally one expects no dupilcates or at most few duplicates).

I think, we can benchmark it later, but I'll not do it now.

Yes - this is something that is probably not going to give much benefit. The point is to avoid materializing left_on twice (as we materialize it anyway earlier)

Thank you!

@pstorozenko
Copy link
Contributor Author

  1. in dropmissing and dropmissing! do not invoke completecases if no considered columns potentially allow for missing

With #2726 optimization to completecases I don't see a point in writing additional ifs in dropmissing as it will only obscure code and not result in any benefits, don't you think?

@bkamins
Copy link
Member

bkamins commented Apr 17, 2021

it will only obscure code and not result in any benefits, don't you think?

The corner case is when there are no missings. In this case you probably want to avoid allocations and just make a view with : as row selector.


Also as a general comment - all my proposals are speculative, i.e. I have not implemented and benchmarked them. We will make these changes if we can see the benefit.

@bkamins
Copy link
Member

bkamins commented May 16, 2021

Given #2727 is merged could you please review what should be done in this PR? Thank you!

@pstorozenko
Copy link
Contributor Author

Yes, I'll look on it today.

@pstorozenko
Copy link
Contributor Author

I've tested versions in a more realistic scenario as you suggested.
I took Posts and Votes from stackoverflow dataset, missified p percent of values in merge column and innerjoined frames.

Version with view is faster here and allocates less memory, so let's leave this version.

posts = GZip.open("match/Posts.csv.gz") do f
    CSV.read(f, DataFrame)
end

votes = GZip.open("match/Votes.csv.gz") do f
    CSV.read(f, DataFrame)
end


function testp(posts, votes, p)
    Nv = nrow(votes)
    v2 = copy(votes)
    allowmissing!(v2, :PostId)
    iv = sample(1:Nv, trunc(Int, Nv * p), replace=false)
    v2[iv, :PostId] .= missing

    Np = nrow(posts)
    p2 = copy(posts)
    allowmissing!(p2, :Id)
    ip = sample(1:Np, trunc(Int, Np * p), replace=false)
    p2[ip, :Id] .= missing

    @btime innerjoin($p2, $v2, on = :Id => :PostId, makeunique = true, matchmissing = :notequal_view);
    @btime innerjoin($p2, $v2, on = :Id => :PostId, makeunique = true, matchmissing = :notequal_copy);
end

for p in [0.0, 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 0.75]
    @show p
    testp(posts, votes, p);
end
p = 0.0
  86.736 ms (623 allocations: 116.23 MiB)
  96.405 ms (720 allocations: 145.00 MiB)
p = 0.001
  88.957 ms (625 allocations: 120.90 MiB)
  103.818 ms (701 allocations: 149.64 MiB)
p = 0.002
  93.594 ms (625 allocations: 120.68 MiB)
  119.779 ms (701 allocations: 149.40 MiB)
p = 0.005
  105.603 ms (625 allocations: 120.00 MiB)
  116.058 ms (701 allocations: 148.66 MiB)
p = 0.01
  103.518 ms (625 allocations: 118.92 MiB)
  119.578 ms (701 allocations: 147.48 MiB)
p = 0.02
  103.239 ms (625 allocations: 116.80 MiB)
  115.631 ms (701 allocations: 145.15 MiB)
p = 0.05
  95.703 ms (625 allocations: 110.15 MiB)
  111.279 ms (701 allocations: 137.88 MiB)
p = 0.1
  87.453 ms (625 allocations: 102.36 MiB)
  102.669 ms (701 allocations: 128.97 MiB)
p = 0.2
  71.221 ms (625 allocations: 82.82 MiB)
  83.093 ms (701 allocations: 107.16 MiB)
p = 0.5
  39.548 ms (628 allocations: 36.54 MiB)
  48.013 ms (704 allocations: 52.86 MiB)
p = 0.75
  12.758 ms (626 allocations: 11.03 MiB)
  17.022 ms (702 allocations: 19.71 MiB)

@bkamins
Copy link
Member

bkamins commented May 30, 2021

Thank you for checking this. Given your past PRs you probably know what needs to be done (please let me know if you are willing to do this):

  1. merge main into the PR (to make sure we are on the latest version)
  2. finalize the code
  3. write comprehensive tests with full coverage of corner cases
  4. add an appropriate NEWS.md entry
  5. add appropriate updates to docstrings and to the manual
  6. then go through the reviews 😄

Thank you!

@pstorozenko
Copy link
Contributor Author

Sure thing!
I rebased my code onto main, should I rather merge?

@bkamins
Copy link
Member

bkamins commented May 30, 2021

Well - I used to tell people to rebase, squash to one commit and force push the changes (this is what I normally do if I push a rewrite; merging is preferable if there are significant comments to the implementation pending in the old code, but in this case we do not have such, as we have discussed the design already, so I need to anyway just review the final implementation).

However, most contributors had problems with following this workflow so I started suggesting merging, which is usually simpler to handle properly in git 😄.

In short - if you rebased this is preferable.

Not documented, available for testing purpose
We stay with view
Some tests added
NEWS updated
src/join/composer.jl Outdated Show resolved Hide resolved
test/join.jl Outdated Show resolved Hide resolved
test/join.jl Outdated Show resolved Hide resolved
pstorozenko and others added 3 commits May 31, 2021 19:27
Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>
Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it it is not a draft any more :).

@nalimilan - the PR is ready for you to have a look at. Thank you!

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just a few minor points.

src/join/composer.jl Outdated Show resolved Hide resolved
src/join/composer.jl Outdated Show resolved Hide resolved
src/join/composer.jl Outdated Show resolved Hide resolved
test/join.jl Outdated Show resolved Hide resolved
test/join.jl Outdated Show resolved Hide resolved
test/join.jl Outdated Show resolved Hide resolved
@nalimilan nalimilan marked this pull request as ready for review June 1, 2021 15:14
@bkamins
Copy link
Member

bkamins commented Jun 1, 2021

@nalimilan - thank you for the review. You always have a keen eye for the details.

also some minor docs fixes
Remove types where not needed
Remove spaces around =
Align lines better
src/join/composer.jl Outdated Show resolved Hide resolved
src/join/composer.jl Outdated Show resolved Hide resolved
src/join/composer.jl Outdated Show resolved Hide resolved
src/join/composer.jl Outdated Show resolved Hide resolved
src/join/composer.jl Outdated Show resolved Hide resolved
pstorozenko and others added 3 commits June 2, 2021 09:26
copycols=true->false

Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>
@pstorozenko
Copy link
Contributor Author

Is there a way of telling codecov that some lines should not be reached?

src/join/composer.jl Outdated Show resolved Hide resolved
test/join.jl Outdated Show resolved Hide resolved
test/join.jl Outdated Show resolved Hide resolved
test/join.jl Outdated Show resolved Hide resolved
test/join.jl Outdated Show resolved Hide resolved
test/join.jl Outdated Show resolved Hide resolved
test/join.jl Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member

bkamins commented Jun 2, 2021

Is there a way of telling codecov that some lines should not be reached?

No. If you have unreachable code (I have not looked at your commits yet) and want to make sure it is not reached throw an error there. See e.g.

error("unreachable reached")

@bkamins
Copy link
Member

bkamins commented Jun 2, 2021

The PR looks good to me (if @nalimilan is OK with the code formatting, especially in the test section)

@pstorozenko
Copy link
Contributor Author

Thank you for reviews!

@bkamins bkamins merged commit 5d8e52b into JuliaData:main Jun 3, 2021
@bkamins
Copy link
Member

bkamins commented Jun 3, 2021

Thank you!

@pstorozenko pstorozenko deleted the ps/match_notequal branch June 3, 2021 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants