Matchmissing == :notequal #2724

pstorozenko · 2021-04-17T14:12:31Z

Not documented, available for testing purpose.

As discussed in #2650 I create PR with my changes.

src/join/composer.jl

bkamins · 2021-04-17T14:50:29Z

So things to do to improve the performance are the following (this is a general comment without running of the code):

in completecases(df::AbstractDataFrame, col::Colon=:) process only columns that potentially allow missing values
in dropmissing and dropmissing! do not invoke completecases if no considered columns potentially allow for missing
in SubDataFrame(parent::DataFrame, rows::AbstractVector{Bool}, cols) first check if rows creates one continuous block of rows; if yes then instead of findall use a range to avoid allocations (actually probably a custom findall should be written to do it in one shot)

(these changes should probably go in separate PRs to make sure we can pin-point the performance changes introduced by them)

Regarding memory profiling - I will run my test and report here later.

bkamins · 2021-04-17T15:44:59Z

Regarding the performance of view vs copy the offending line is:

DataFrames.jl/src/subdataframe/subdataframe.jl

Line 171 in a671a3b

parent(sdf)[rows(sdf)[rowinds], parentcols(index(sdf), colinds)]

When doing getindex in a SubDataFrame we allocate one temporary vector (both for left and right) that has the size comparable to a data frame. So we do "as if" we had 2 extra columns.

Therefore a tentative conclusion is:

in the dropmissing even if we use view we should check if actually any rows were dropped and if not pass through the original DataFrame (if it were a DataFrame already) - this will save some time.
we probably should branch conditional on the number of columns in a data frame (the best cut-off should be identified by benchmarking)
- if number of columns is low use a copy
- if number of columns is high use a view (as otherwise it is extremely easy to "Kill" the process due to out-of-memory)
an even more advanced optimization would be to split line

DataFrames.jl/src/join/composer.jl

Line 111 in a671a3b

dfl = joiner.dfl[left_ixs, :]

into two parts: left_noon and left_on (as left_on is always materialized) - but I am not sure if it is worth the effort

Also maybe we should just always use a view if benchmarking shows that the speed differences are not very big, as joins are very memory hungry, so in this case it is better not to do too much temporary allocations.

In summary:

in the first round I would go for the view option - even if we know it is slower - it is safer (or a mix of copy and view with low threshold for column count when we switch to view)
then in consecutive PRs work on improving the performance (and my experience is that these things are usually a deep hole where one optimization leads to further requests for changes, but this is how it works :))

bkamins · 2021-04-17T15:45:50Z

Just to be clear on my position - I have "Killed" my Julia session several times when benchmarking with "copy" option, while "view" option worked each time.

bkamins · 2021-04-17T15:52:55Z

@pstorozenko - please let me know on which things listed here you would be willing to work, so that I can plan better the development. Thank you!

Also in general after 1.0 release of DataFrames.jl there are four streams of potential things that maybe you would be interested in to contribute after we are done with this PR/series of PRs:

performance (things like discussed in this PR + improving multi threading support); here - as commented - PRs should be small and focus on a single thing that is improved;
adding new functionality (here the key discussions are usually API design considerations);
documentation improvements;
test coverage improvement.

pstorozenko · 2021-04-17T17:50:36Z

I'd propose the following:

in completecases(df::AbstractDataFrame, col::Colon=:) process only columns that potentially allow missing values

in dropmissing and dropmissing! do not invoke completecases if no considered columns potentially allow for missing

I'll do PRs for those two.
Is there a nicer way of getting a vector of missingable columns than Missing .<: eltype.(eachcol(df))?

in SubDataFrame(parent::DataFrame, rows::AbstractVector{Bool}, cols) first check if rows creates one continuous block of rows; if yes then instead of findall use a range to avoid allocations (actually probably a custom findall should be written to do it in one shot)

Do you have any suggestions on naming a custom findall function and it's full signature?
I'll to PR then.

Therefore a tentative conclusion is: ...

In this PR I'll do some quick benchmarks on number of columns and then write solution with mixed view / copy approach for :notequal as well as write documentation and tests.

an even more advanced optimization would be to split line

I think, we can benchmark it later, but I'll not do it now.

Also in general after 1.0 release of DataFrames.jl there are four streams of potential things that maybe you would be interested in to contribute after we are done with this PR/series of PRs:

Thanks a lot! We will see later.

bkamins · 2021-04-17T17:57:48Z

Missing .<: eltype.(eachcol(df))

is OK, but do not do this this way, as it will add compilation latency without any benefit. better just iterate columns of df and compare their eltype to Missing.

Do you have any suggestions on naming a custom findall function and it's full signature?

maybe just _findall(v::AbstractVector{Bool}))

In this PR I'll do some quick benchmarks on number of columns and then write solution with mixed view / copy approach

OK, just please keep in mind that I prefer a solution that is not memory hungry in general. Also note when benchmarking that your original benchmark was kind of "worst case" as the join produced many more rows than the input tables because of the way how you generated data. This is not a typical scenario (normally one expects no dupilcates or at most few duplicates).

I think, we can benchmark it later, but I'll not do it now.

Yes - this is something that is probably not going to give much benefit. The point is to avoid materializing left_on twice (as we materialize it anyway earlier)

Thank you!

pstorozenko · 2021-04-17T18:23:14Z

in dropmissing and dropmissing! do not invoke completecases if no considered columns potentially allow for missing

With #2726 optimization to completecases I don't see a point in writing additional ifs in dropmissing as it will only obscure code and not result in any benefits, don't you think?

bkamins · 2021-04-17T18:29:35Z

it will only obscure code and not result in any benefits, don't you think?

The corner case is when there are no missings. In this case you probably want to avoid allocations and just make a view with : as row selector.

Also as a general comment - all my proposals are speculative, i.e. I have not implemented and benchmarked them. We will make these changes if we can see the benefit.

bkamins · 2021-05-16T17:12:56Z

Given #2727 is merged could you please review what should be done in this PR? Thank you!

pstorozenko · 2021-05-16T17:14:22Z

Yes, I'll look on it today.

pstorozenko · 2021-05-30T19:13:02Z

I've tested versions in a more realistic scenario as you suggested.
I took Posts and Votes from stackoverflow dataset, missified p percent of values in merge column and innerjoined frames.

Version with view is faster here and allocates less memory, so let's leave this version.

posts = GZip.open("match/Posts.csv.gz") do f
    CSV.read(f, DataFrame)
end

votes = GZip.open("match/Votes.csv.gz") do f
    CSV.read(f, DataFrame)
end


function testp(posts, votes, p)
    Nv = nrow(votes)
    v2 = copy(votes)
    allowmissing!(v2, :PostId)
    iv = sample(1:Nv, trunc(Int, Nv * p), replace=false)
    v2[iv, :PostId] .= missing

    Np = nrow(posts)
    p2 = copy(posts)
    allowmissing!(p2, :Id)
    ip = sample(1:Np, trunc(Int, Np * p), replace=false)
    p2[ip, :Id] .= missing

    @btime innerjoin($p2, $v2, on = :Id => :PostId, makeunique = true, matchmissing = :notequal_view);
    @btime innerjoin($p2, $v2, on = :Id => :PostId, makeunique = true, matchmissing = :notequal_copy);
end

for p in [0.0, 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 0.75]
    @show p
    testp(posts, votes, p);
end

p = 0.0
  86.736 ms (623 allocations: 116.23 MiB)
  96.405 ms (720 allocations: 145.00 MiB)
p = 0.001
  88.957 ms (625 allocations: 120.90 MiB)
  103.818 ms (701 allocations: 149.64 MiB)
p = 0.002
  93.594 ms (625 allocations: 120.68 MiB)
  119.779 ms (701 allocations: 149.40 MiB)
p = 0.005
  105.603 ms (625 allocations: 120.00 MiB)
  116.058 ms (701 allocations: 148.66 MiB)
p = 0.01
  103.518 ms (625 allocations: 118.92 MiB)
  119.578 ms (701 allocations: 147.48 MiB)
p = 0.02
  103.239 ms (625 allocations: 116.80 MiB)
  115.631 ms (701 allocations: 145.15 MiB)
p = 0.05
  95.703 ms (625 allocations: 110.15 MiB)
  111.279 ms (701 allocations: 137.88 MiB)
p = 0.1
  87.453 ms (625 allocations: 102.36 MiB)
  102.669 ms (701 allocations: 128.97 MiB)
p = 0.2
  71.221 ms (625 allocations: 82.82 MiB)
  83.093 ms (701 allocations: 107.16 MiB)
p = 0.5
  39.548 ms (628 allocations: 36.54 MiB)
  48.013 ms (704 allocations: 52.86 MiB)
p = 0.75
  12.758 ms (626 allocations: 11.03 MiB)
  17.022 ms (702 allocations: 19.71 MiB)

bkamins · 2021-05-30T20:21:34Z

Thank you for checking this. Given your past PRs you probably know what needs to be done (please let me know if you are willing to do this):

merge main into the PR (to make sure we are on the latest version)
finalize the code
write comprehensive tests with full coverage of corner cases
add an appropriate NEWS.md entry
add appropriate updates to docstrings and to the manual
then go through the reviews 😄

Thank you!

pstorozenko · 2021-05-30T20:25:20Z

Sure thing!
I rebased my code onto main, should I rather merge?

bkamins · 2021-05-30T20:29:21Z

Well - I used to tell people to rebase, squash to one commit and force push the changes (this is what I normally do if I push a rewrite; merging is preferable if there are significant comments to the implementation pending in the old code, but in this case we do not have such, as we have discussed the design already, so I need to anyway just review the final implementation).

However, most contributors had problems with following this workflow so I started suggesting merging, which is usually simpler to handle properly in git 😄.

In short - if you rebased this is preferable.

Not documented, available for testing purpose

We stay with view Some tests added NEWS updated

src/join/composer.jl

test/join.jl

Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>

bkamins

I guess it it is not a draft any more :).

@nalimilan - the PR is ready for you to have a look at. Thank you!

nalimilan

Thanks! Just a few minor points.

src/join/composer.jl

test/join.jl

bkamins · 2021-06-01T17:10:06Z

@nalimilan - thank you for the review. You always have a keen eye for the details.

also some minor docs fixes

Remove types where not needed Remove spaces around = Align lines better

src/join/composer.jl

copycols=true->false Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>

pstorozenko · 2021-06-02T17:15:53Z

Is there a way of telling codecov that some lines should not be reached?

src/join/composer.jl

test/join.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2021-06-02T22:19:20Z

Is there a way of telling codecov that some lines should not be reached?

No. If you have unreachable code (I have not looked at your commits yet) and want to make sure it is not reached throw an error there. See e.g.

DataFrames.jl/src/join/core.jl

Line 562 in 67ccc07

error("unreachable reached")

bkamins · 2021-06-02T22:26:56Z

The PR looks good to me (if @nalimilan is OK with the code formatting, especially in the test section)

pstorozenko · 2021-06-03T18:12:19Z

Thank you for reviews!

bkamins · 2021-06-03T18:53:52Z

Thank you!

pstorozenko mentioned this pull request Apr 17, 2021

Add matchmissing = :notequal option #2650

Closed

bkamins reviewed Apr 17, 2021

View reviewed changes

src/join/composer.jl Outdated Show resolved Hide resolved

bkamins reviewed Apr 17, 2021

View reviewed changes

src/join/composer.jl Outdated Show resolved Hide resolved

bkamins reviewed Apr 17, 2021

View reviewed changes

src/join/composer.jl Outdated Show resolved Hide resolved

bkamins reviewed Apr 17, 2021

View reviewed changes

src/join/composer.jl Outdated Show resolved Hide resolved

bkamins reviewed Apr 17, 2021

View reviewed changes

src/join/composer.jl Outdated Show resolved Hide resolved

bkamins reviewed Apr 17, 2021

View reviewed changes

src/join/composer.jl Outdated Show resolved Hide resolved

pstorozenko mentioned this pull request Apr 17, 2021

Optimize completecases to process only missingable columns #2726

Merged

pstorozenko mentioned this pull request Apr 18, 2021

Run findall(rows) only if rows are not all true #2727

Merged

bkamins added the feature label Apr 18, 2021

bkamins added this to the 1.x milestone Apr 18, 2021

bkamins added the joins label May 16, 2021

pstorozenko mentioned this pull request May 22, 2021

Explicit loop in _findall to avoid allocations #2771

Merged

pstorozenko added 2 commits May 30, 2021 23:30

Matchmissing == :notequal

03721c2

Not documented, available for testing purpose

matchmissing==:notequals

432560f

We stay with view Some tests added NEWS updated

pstorozenko force-pushed the ps/match_notequal branch from 3b85d02 to 432560f Compare May 30, 2021 21:54

bkamins reviewed May 31, 2021

View reviewed changes

src/join/composer.jl Outdated Show resolved Hide resolved

bkamins reviewed May 31, 2021

View reviewed changes

test/join.jl Outdated Show resolved Hide resolved

bkamins reviewed May 31, 2021

View reviewed changes

test/join.jl Outdated Show resolved Hide resolved

pstorozenko and others added 3 commits May 31, 2021 19:27

Apply suggestions from code review

9afafae

Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>

Update NEWS

8d82161

More tests

94da602

bkamins approved these changes Jun 1, 2021

View reviewed changes

nalimilan reviewed Jun 1, 2021

View reviewed changes

nalimilan marked this pull request as ready for review June 1, 2021 15:14

pstorozenko added 2 commits June 1, 2021 23:47

Add copycols=true to selects

7a86c06

also some minor docs fixes

Changes in tests

9ba9d9b

Remove types where not needed Remove spaces around = Align lines better

bkamins reviewed Jun 2, 2021

View reviewed changes

src/join/composer.jl Outdated Show resolved Hide resolved

bkamins reviewed Jun 2, 2021

View reviewed changes

src/join/composer.jl Outdated Show resolved Hide resolved

bkamins reviewed Jun 2, 2021

View reviewed changes

src/join/composer.jl Outdated Show resolved Hide resolved

bkamins reviewed Jun 2, 2021

View reviewed changes

src/join/composer.jl Outdated Show resolved Hide resolved

bkamins reviewed Jun 2, 2021

View reviewed changes

src/join/composer.jl Outdated Show resolved Hide resolved

pstorozenko and others added 3 commits June 2, 2021 09:26

Apply suggestions from code review

1edef1b

copycols=true->false Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>

Create dfl_on only once

aec82da

bug fix

e24c7f2

nalimilan reviewed Jun 2, 2021

View reviewed changes

Apply suggestions from code review

c534633

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

nalimilan approved these changes Jun 3, 2021

View reviewed changes

bkamins merged commit 5d8e52b into JuliaData:main Jun 3, 2021

pstorozenko deleted the ps/match_notequal branch June 3, 2021 18:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matchmissing == :notequal #2724

Matchmissing == :notequal #2724

pstorozenko commented Apr 17, 2021

bkamins commented Apr 17, 2021 •

edited

Loading

bkamins commented Apr 17, 2021

bkamins commented Apr 17, 2021

bkamins commented Apr 17, 2021

pstorozenko commented Apr 17, 2021

bkamins commented Apr 17, 2021

pstorozenko commented Apr 17, 2021

bkamins commented Apr 17, 2021

bkamins commented May 16, 2021

pstorozenko commented May 16, 2021

pstorozenko commented May 30, 2021

bkamins commented May 30, 2021

pstorozenko commented May 30, 2021

bkamins commented May 30, 2021

bkamins left a comment

nalimilan left a comment

bkamins commented Jun 1, 2021

pstorozenko commented Jun 2, 2021

bkamins commented Jun 2, 2021

bkamins commented Jun 2, 2021

pstorozenko commented Jun 3, 2021

bkamins commented Jun 3, 2021

Matchmissing == :notequal #2724

Matchmissing == :notequal #2724

Conversation

pstorozenko commented Apr 17, 2021

bkamins commented Apr 17, 2021 • edited Loading

bkamins commented Apr 17, 2021

bkamins commented Apr 17, 2021

bkamins commented Apr 17, 2021

pstorozenko commented Apr 17, 2021

bkamins commented Apr 17, 2021

pstorozenko commented Apr 17, 2021

bkamins commented Apr 17, 2021

bkamins commented May 16, 2021

pstorozenko commented May 16, 2021

pstorozenko commented May 30, 2021

bkamins commented May 30, 2021

pstorozenko commented May 30, 2021

bkamins commented May 30, 2021

bkamins left a comment

Choose a reason for hiding this comment

nalimilan left a comment

Choose a reason for hiding this comment

bkamins commented Jun 1, 2021

pstorozenko commented Jun 2, 2021

bkamins commented Jun 2, 2021

bkamins commented Jun 2, 2021

pstorozenko commented Jun 3, 2021

bkamins commented Jun 3, 2021

bkamins commented Apr 17, 2021 •

edited

Loading