Skip to content

Add matchmissing = :notequal option #2650

Closed
@nilshg

Description

@nilshg

Consider two DataFrames:

julia> df = DataFrame(a = [1, 2, missing, missing])
4×1 DataFrame
 Row │ a       
     │ Int64?  
─────┼─────────
   1 │       1
   2 │       2
   3 │ missing 
   4 │ missing 

julia> df2 = DataFrame(a = [1, 3, missing, missing], b = rand(4))
4×2 DataFrame
 Row │ a        b        
     │ Int64?   Float64  
─────┼───────────────────
   1 │       1  0.459054
   2 │       3  0.649346
   3 │ missing  0.875563
   4 │ missing  0.709856

currently to join them I would need to set matchmissing = :equal, which produces duplicates:

julia> leftjoin(df, df2, on = :a, matchmissing = :equal)
6×2 DataFrame
 Row │ a        b
     │ Int64?   Float64?       
─────┼─────────────────────────
   1 │       1        0.459054
   2 │ missing        0.875563
   3 │ missing        0.875563
   4 │ missing        0.709856
   5 │ missing        0.709856
   6 │       2  missing     

I would like an option matchmissing = :ignore (or whatever other name) that preserves the left table exactly, and only adds information on the right side where non-missing values match. Currently I think this would be achieved via

julia> leftjoin(df, dropmissing(df2), on = :a, matchmissing = :equal)
4×2 DataFrame
 Row │ a        b
     │ Int64?   Float64?       
─────┼─────────────────────────
   1 │       1        0.459054
   2 │       2  missing        
   3 │ missing  missing        
   4 │ missing  missing       

which is a bit counterintuitive from an API perspective (I need to set matchmissing to :equal even though I don't want to match missings!), and might also suboptimal from an efficiency perspective.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions