Skip to content

Matchmissing == :notequal #2724

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jun 3, 2021
9 changes: 9 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,14 @@
# DataFrames.jl changes on main since last release notes

## New functionalities

* add option `matchmissing=:notequal` in joins;
in `leftjoin`, `semijoin` and `antijoin` `missing`s are dropped in right df,
but preserved in left; in `rightjoin` `missing`s are dropped in left df,
but preserved in right df; in `innerjoin` `missing`s are dropped in both dfs;
in `outerjoin` method errors
([#2724](https://github.com/JuliaData/DataFrames.jl/pull/2724))

## Bug fixes

* fix bug in how `issorted` handles custom orderings and improve performance
Expand Down
39 changes: 31 additions & 8 deletions src/join/composer.jl
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,8 @@ struct DataFrameJoiner

function DataFrameJoiner(dfl::AbstractDataFrame, dfr::AbstractDataFrame,
on::Union{<:OnType, AbstractVector},
matchmissing::Symbol)
matchmissing::Symbol,
kind::Symbol)
on_cols = isa(on, AbstractVector) ? on : [on]
left_on = Symbol[]
right_on = Symbol[]
Expand Down Expand Up @@ -55,8 +56,25 @@ struct DataFrameJoiner
"when matchmissing == :error"))
end
end
elseif matchmissing === :notequal
if kind in (:left, :semi, :anti)
dfr = dropmissing(dfr, right_on, view = true)
dfr_on = select(dfr, right_on)
elseif kind === :right
dfl = dropmissing(dfl, left_on, view = true)
dfl_on = select(dfl, left_on)
elseif kind === :inner
dfl = dropmissing(dfl, left_on, view = true)
dfl_on = select(dfl, left_on)
dfr = dropmissing(dfr, right_on, view = true)
dfr_on = select(dfr, right_on)
elseif kind === :outer
throw(ArgumentError("matchmissing == :notequal for `outerjoin` is not allowed"))
else
throw(ArgumentError("matchmissing == :notequal not implemented for kind == $kind"))
end
elseif matchmissing !== :equal
throw(ArgumentError("matchmissing allows only :error or :equal"))
throw(ArgumentError("matchmissing allows only :error, :notequal and :equal"))
end

for df in (dfl_on, dfr_on), col in eachcol(df)
Expand Down Expand Up @@ -311,7 +329,7 @@ function _join(df1::AbstractDataFrame, df2::AbstractDataFrame;
throw(ArgumentError("Missing join argument 'on'."))
end

joiner = DataFrameJoiner(df1, df2, on, matchmissing)
joiner = DataFrameJoiner(df1, df2, on, matchmissing, kind)

# Check merge key validity
left_invalid = validate[1] ? any(nonunique(joiner.dfl, joiner.left_on)) : false
Expand Down Expand Up @@ -485,7 +503,8 @@ change in future releases.
data frame and left unchanged.
- `matchmissing` : if equal to `:error` throw an error if `missing` is present
in `on` columns; if equal to `:equal` then `missing` is allowed and missings are
matched (`isequal` is used for comparisons of rows for equality)
matched; if equal to `:notequal` then missings are dropped in `df1` and `df2`
`on` columns (`isequal` is used for comparisons of rows for equality)

It is not allowed to join on columns that contain `NaN` or `-0.0` in real or
imaginary part of the number. If you need to perform a join on such values use
Expand Down Expand Up @@ -626,7 +645,8 @@ change in future releases.
data frame and left unchanged.
- `matchmissing` : if equal to `:error` throw an error if `missing` is present
in `on` columns; if equal to `:equal` then `missing` is allowed and missings are
matched (`isequal` is used for comparisons of rows for equality)
matched; if equal to `:notequal` then missings are dropped in `df2` `on` columns
(`isequal` is used for comparisons of rows for equality)

All columns of the returned data table will support missing values.

Expand Down Expand Up @@ -772,7 +792,8 @@ change in future releases.
data frame and left unchanged.
- `matchmissing` : if equal to `:error` throw an error if `missing` is present
in `on` columns; if equal to `:equal` then `missing` is allowed and missings are
matched (`isequal` is used for comparisons of rows for equality)
matched; if equal to `:notequal` then missings are dropped in `df1` `on` columns
(`isequal` is used for comparisons of rows for equality)

All columns of the returned data table will support missing values.

Expand Down Expand Up @@ -1071,7 +1092,8 @@ The order of rows in the result is undefined and may change in the future releas
By default no check is performed.
- `matchmissing` : if equal to `:error` throw an error if `missing` is present
in `on` columns; if equal to `:equal` then `missing` is allowed and missings are
matched (`isequal` is used for comparisons of rows for equality)
matched; if equal to `:notequal` then missings are dropped in `df2` `on` columns
(`isequal` is used for comparisons of rows for equality)

It is not allowed to join on columns that contain `NaN` or `-0.0` in real or
imaginary part of the number. If you need to perform a join on such values use
Expand Down Expand Up @@ -1176,7 +1198,8 @@ The order of rows in the result is undefined and may change in the future releas
By default no check is performed.
- `matchmissing` : if equal to `:error` throw an error if `missing` is present
in `on` columns; if equal to `:equal` then `missing` is allowed and missings are
matched (`isequal` is used for comparisons of rows for equality)
matched; if equal to `:notequal` then missings are dropped in `df2` `on` columns
(`isequal` is used for comparisons of rows for equality)

It is not allowed to join on columns that contain `NaN` or `-0.0` in real or
imaginary part of the number. If you need to perform a join on such values use
Expand Down
25 changes: 25 additions & 0 deletions test/join.jl
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ anti = left[Bool[ismissing(x) for x in left.Job], [:ID, :Name]]

@test_throws ArgumentError innerjoin(name, job)
@test_throws ArgumentError innerjoin(name, job, on = :ID, matchmissing=:errors)
@test_throws ArgumentError innerjoin(name, job, on = :ID, matchmissing=:weirdmatch)
@test_throws ArgumentError outerjoin(name, job, on = :ID, matchmissing=:notequal)

@test innerjoin(name, job, on = :ID) == inner
@test outerjoin(name, job, on = :ID) ≅ outer
Expand Down Expand Up @@ -1557,4 +1559,27 @@ end
c="c", d="d")
end

@testset "matchmissing :notequal correctness" begin
name = DataFrame(ID = Union{Int, Missing}[1, 2, missing],
Name = Union{String, Missing}["John Doe", "Jane Doe", "Joe Blogs"])
noid = DataFrame(ID = Union{Int, Missing}[], Name = String[])
missid = DataFrame(ID = Union{Int, Missing}[missing, missing, missing],
Name = String["John Doe", "Jane Doe", "Joe Blogs"])
job = DataFrame(ID = Union{Int, Missing}[missing, 2, 2, 4],
Job = Union{String, Missing}["Lawyer", "Doctor", "Florist", "Farmer"]);

for df in [name, noid, missid]
@test leftjoin(df, dropmissing(job), on=:ID, matchmissing=:equal) ≅
leftjoin(df, job, on=:ID, matchmissing=:notequal)
@test semijoin(df, dropmissing(job), on=:ID, matchmissing=:equal) ≅
semijoin(df, job, on=:ID, matchmissing=:notequal)
@test antijoin(df, dropmissing(job), on=:ID, matchmissing=:equal) ≅
antijoin(df, job, on=:ID, matchmissing=:notequal)
@test rightjoin(dropmissing(df), job, on=:ID, matchmissing=:equal) ≅
rightjoin(df, job, on=:ID, matchmissing=:notequal)
@test innerjoin(dropmissing(df), dropmissing(job), on=:ID, matchmissing=:equal) ≅
innerjoin(df, job, on=:ID, matchmissing=:notequal)
end
end

end # module