Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matchmissing == :notequal #2724

Merged
merged 12 commits into from
Jun 3, 2021
9 changes: 9 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,14 @@
# DataFrames.jl changes on main since last release notes

## New functionalities

* add option `matchmissing=:notequal` in joins;
in `leftjoin`, `semijoin` and `antijoin` `missing`s are dropped in right df,
but preserved in left; in `rightjoin` `missing`s are dropped in left df,
but preserved in right df; in `innerjoin` `missing`s are dropped in both dfs;
in `outerjoin` method errors
([#2724](https://github.com/JuliaData/DataFrames.jl/pull/2724))
pstorozenko marked this conversation as resolved.
Show resolved Hide resolved

## Bug fixes

* fix bug in how `issorted` handles custom orderings and improve performance
Expand Down
39 changes: 31 additions & 8 deletions src/join/composer.jl
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,8 @@ struct DataFrameJoiner

function DataFrameJoiner(dfl::AbstractDataFrame, dfr::AbstractDataFrame,
on::Union{<:OnType, AbstractVector},
matchmissing::Symbol)
matchmissing::Symbol,
kind::Symbol)
on_cols = isa(on, AbstractVector) ? on : [on]
left_on = Symbol[]
right_on = Symbol[]
Expand Down Expand Up @@ -55,8 +56,25 @@ struct DataFrameJoiner
"when matchmissing == :error"))
end
end
elseif matchmissing === :notequal
if kind in (:left, :semi, :anti)
dfr = dropmissing(dfr, right_on, view = true)
pstorozenko marked this conversation as resolved.
Show resolved Hide resolved
dfr_on = select(dfr, right_on)
pstorozenko marked this conversation as resolved.
Show resolved Hide resolved
elseif kind === :right
dfl = dropmissing(dfl, left_on, view = true)
pstorozenko marked this conversation as resolved.
Show resolved Hide resolved
dfl_on = select(dfl, left_on)
elseif kind === :inner
dfl = dropmissing(dfl, left_on, view = true)
pstorozenko marked this conversation as resolved.
Show resolved Hide resolved
pstorozenko marked this conversation as resolved.
Show resolved Hide resolved
dfl_on = select(dfl, left_on)
dfr = dropmissing(dfr, right_on, view = true)
pstorozenko marked this conversation as resolved.
Show resolved Hide resolved
dfr_on = select(dfr, right_on)
elseif kind === :outer
throw(ArgumentError("matchmissing == :notequal for `outerjoin` is not allowed"))
else
throw(ArgumentError("matchmissing == :notequal not implemented for kind == $kind"))
pstorozenko marked this conversation as resolved.
Show resolved Hide resolved
end
elseif matchmissing !== :equal
throw(ArgumentError("matchmissing allows only :error or :equal"))
throw(ArgumentError("matchmissing allows only :error, :notequal and :equal"))
pstorozenko marked this conversation as resolved.
Show resolved Hide resolved
end

for df in (dfl_on, dfr_on), col in eachcol(df)
Expand Down Expand Up @@ -311,7 +329,7 @@ function _join(df1::AbstractDataFrame, df2::AbstractDataFrame;
throw(ArgumentError("Missing join argument 'on'."))
end

joiner = DataFrameJoiner(df1, df2, on, matchmissing)
joiner = DataFrameJoiner(df1, df2, on, matchmissing, kind)

# Check merge key validity
left_invalid = validate[1] ? any(nonunique(joiner.dfl, joiner.left_on)) : false
Expand Down Expand Up @@ -485,7 +503,8 @@ change in future releases.
data frame and left unchanged.
- `matchmissing` : if equal to `:error` throw an error if `missing` is present
in `on` columns; if equal to `:equal` then `missing` is allowed and missings are
matched (`isequal` is used for comparisons of rows for equality)
matched; if equal to `:notequal` then missings are dropped in `df1` and `df2`
`on` columns (`isequal` is used for comparisons of rows for equality)
pstorozenko marked this conversation as resolved.
Show resolved Hide resolved

It is not allowed to join on columns that contain `NaN` or `-0.0` in real or
imaginary part of the number. If you need to perform a join on such values use
Expand Down Expand Up @@ -626,7 +645,8 @@ change in future releases.
data frame and left unchanged.
- `matchmissing` : if equal to `:error` throw an error if `missing` is present
in `on` columns; if equal to `:equal` then `missing` is allowed and missings are
matched (`isequal` is used for comparisons of rows for equality)
matched; if equal to `:notequal` then missings are dropped in `df2` `on` columns
(`isequal` is used for comparisons of rows for equality)

All columns of the returned data table will support missing values.

Expand Down Expand Up @@ -772,7 +792,8 @@ change in future releases.
data frame and left unchanged.
- `matchmissing` : if equal to `:error` throw an error if `missing` is present
in `on` columns; if equal to `:equal` then `missing` is allowed and missings are
matched (`isequal` is used for comparisons of rows for equality)
matched; if equal to `:notequal` then missings are dropped in `df1` `on` columns
(`isequal` is used for comparisons of rows for equality)

All columns of the returned data table will support missing values.

Expand Down Expand Up @@ -1071,7 +1092,8 @@ The order of rows in the result is undefined and may change in the future releas
By default no check is performed.
- `matchmissing` : if equal to `:error` throw an error if `missing` is present
in `on` columns; if equal to `:equal` then `missing` is allowed and missings are
matched (`isequal` is used for comparisons of rows for equality)
matched; if equal to `:notequal` then missings are dropped in `df2` `on` columns
(`isequal` is used for comparisons of rows for equality)

It is not allowed to join on columns that contain `NaN` or `-0.0` in real or
imaginary part of the number. If you need to perform a join on such values use
Expand Down Expand Up @@ -1176,7 +1198,8 @@ The order of rows in the result is undefined and may change in the future releas
By default no check is performed.
- `matchmissing` : if equal to `:error` throw an error if `missing` is present
in `on` columns; if equal to `:equal` then `missing` is allowed and missings are
matched (`isequal` is used for comparisons of rows for equality)
matched; if equal to `:notequal` then missings are dropped in `df2` `on` columns
(`isequal` is used for comparisons of rows for equality)

It is not allowed to join on columns that contain `NaN` or `-0.0` in real or
imaginary part of the number. If you need to perform a join on such values use
Expand Down
25 changes: 25 additions & 0 deletions test/join.jl
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ anti = left[Bool[ismissing(x) for x in left.Job], [:ID, :Name]]

@test_throws ArgumentError innerjoin(name, job)
@test_throws ArgumentError innerjoin(name, job, on = :ID, matchmissing=:errors)
@test_throws ArgumentError innerjoin(name, job, on = :ID, matchmissing=:weirdmatch)
@test_throws ArgumentError outerjoin(name, job, on = :ID, matchmissing=:notequal)

@test innerjoin(name, job, on = :ID) == inner
@test outerjoin(name, job, on = :ID) ≅ outer
Expand Down Expand Up @@ -1557,4 +1559,27 @@ end
c="c", d="d")
end

@testset "matchmissing :notequal correctness" begin
name = DataFrame(ID = Union{Int, Missing}[1, 2, missing],
pstorozenko marked this conversation as resolved.
Show resolved Hide resolved
Name = Union{String, Missing}["John Doe", "Jane Doe", "Joe Blogs"])
noid = DataFrame(ID = Union{Int, Missing}[], Name = String[])
missid = DataFrame(ID = Union{Int, Missing}[missing, missing, missing],
Name = String["John Doe", "Jane Doe", "Joe Blogs"])
job = DataFrame(ID = Union{Int, Missing}[missing, 2, 2, 4],
pstorozenko marked this conversation as resolved.
Show resolved Hide resolved
Job = Union{String, Missing}["Lawyer", "Doctor", "Florist", "Farmer"]);

for df in [name, noid, missid]
@test leftjoin(df, dropmissing(job), on=:ID, matchmissing=:equal) ≅
leftjoin(df, job, on=:ID, matchmissing=:notequal)
@test semijoin(df, dropmissing(job), on=:ID, matchmissing=:equal) ≅
semijoin(df, job, on=:ID, matchmissing=:notequal)
@test antijoin(df, dropmissing(job), on=:ID, matchmissing=:equal) ≅
antijoin(df, job, on=:ID, matchmissing=:notequal)
@test rightjoin(dropmissing(df), job, on=:ID, matchmissing=:equal) ≅
rightjoin(df, job, on=:ID, matchmissing=:notequal)
@test innerjoin(dropmissing(df), dropmissing(job), on=:ID, matchmissing=:equal) ≅
innerjoin(df, job, on=:ID, matchmissing=:notequal)
end
end

end # module