Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BREAKING] fix isagg to correctly use a fast path #2357

Merged
merged 8 commits into from
Aug 13, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 64 additions & 14 deletions src/groupeddataframe/splitapplycombine.jl
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# this contstant defines which types of values returned by aggregation function
bkamins marked this conversation as resolved.
Show resolved Hide resolved
# in combine are considered to produce multiple columns in the resulting data frame
const MULTI_COLS_TYPE = Union{AbstractDataFrame, NamedTuple, DataFrameRow, AbstractMatrix}

"""
groupby(d::AbstractDataFrame, cols; sort=false, skipmissing=false)

Expand Down Expand Up @@ -452,7 +456,7 @@ function combine(p::Pair, gd::GroupedDataFrame;
# verify if it is not better to use a fast path, which we achieve
# by moving to combine(::GroupedDataFrame, ::AbstractVector) method
# note that even if length(gd) == 0 we can do this step
if isagg(p_from => (p_to isa Pair ? first(p_to) : p_to)) || p_from === nrow
if isagg(p_from => (p_to isa Pair ? first(p_to) : p_to), gd) || p_from === nrow
return combine(gd, [p], keepkeys=keepkeys, ungroup=ungroup)
end

Expand Down Expand Up @@ -761,16 +765,37 @@ end
Reduce(f, condf=nothing, adjust=nothing) = Reduce(f, condf, adjust, false)

check_aggregate(f::Any) = f
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you combine validate_aggregate with check_aggregate? Basically, we have a fallback which returns f, and for some combinations of function and type we return an optimized object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. And I did it now, but it was much easier to handle them separately when developing changes :)

validate_aggregate(f::Any, col::AbstractVector) = false

check_aggregate(::typeof(sum)) = Reduce(Base.add_sum)
validate_aggregate(::typeof(sum), ::AbstractVector{<:Union{Missing, Number}}) = true

check_aggregate(::typeof(sum∘skipmissing)) = Reduce(Base.add_sum, !ismissing)
validate_aggregate(::typeof(sum∘skipmissing), ::AbstractVector{<:Union{Missing, Number}}) = true

check_aggregate(::typeof(prod)) = Reduce(Base.mul_prod)
validate_aggregate(::typeof(prod), ::AbstractVector{<:Union{Missing, Number}}) = true

check_aggregate(::typeof(prod∘skipmissing)) = Reduce(Base.mul_prod, !ismissing)
validate_aggregate(::typeof(prod∘skipmissing), ::AbstractVector{<:Union{Missing, Number}}) = true

check_aggregate(::typeof(maximum)) = Reduce(max)
validate_aggregate(::typeof(maximum), ::AbstractVector{<:Union{Missing, Real}}) = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these methods work for any type. We just have a faster path when isconcretetype(S) && hasmethod($initf, Tuple{S}), but we don't require typemin and typemax in general. In particular, we have code for CategoricalArray.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these methods work for any type.

I think we need this restriction:

julia> df = DataFrame(g=1, x=[[1,2,3], [1,2,3]])
2×2 DataFrame
│ Row │ g     │ x         │
│     │ Int64 │ Array…    │
├─────┼───────┼───────────┤
│ 1   │ 1     │ [1, 2, 3] │
│ 2   │ 1     │ [1, 2, 3] │

julia> gdf = groupby(df, :g)
GroupedDataFrame with 1 group based on key: g
First Group (2 rows): g = 1
│ Row │ g     │ x         │
│     │ Int64 │ Array…    │
├─────┼───────┼───────────┤
│ 1   │ 1     │ [1, 2, 3] │
│ 2   │ 1     │ [1, 2, 3] │

julia> combine(gdf, :x => maximum)
1×2 DataFrame
│ Row │ g     │ x_maximum │
│     │ Int64 │ Array…    │
├─────┼───────┼───────────┤
│ 1   │ 1     │ [1, 2, 3] │

julia> combine(gdf, :x => x -> maximum(x))
3×2 DataFrame
│ Row │ g     │ x_function │
│     │ Int64 │ Int64      │
├─────┼───────┼────────────┤
│ 1   │ 1     │ 1          │
│ 2   │ 1     │ 2          │
│ 3   │ 1     │ 3          │

but I will make the signature looser.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn, again that multi-column issue... So yeah, the fast path needs to be disabled when the column can contain MULTI_COLS_TYPE or AbstractVector.


check_aggregate(::typeof(maximum∘skipmissing)) = Reduce(max, !ismissing, nothing, true)
validate_aggregate(::typeof(maximum∘skipmissing), ::AbstractVector{<:Union{Missing, Real}}) = true

check_aggregate(::typeof(minimum)) = Reduce(min)
validate_aggregate(::typeof(minimum), ::AbstractVector{<:Union{Missing, Real}}) = true

check_aggregate(::typeof(minimum∘skipmissing)) = Reduce(min, !ismissing, nothing, true)
validate_aggregate(::typeof(minimum∘skipmissing), ::AbstractVector{<:Union{Missing, Real}}) = true

check_aggregate(::typeof(mean)) = Reduce(Base.add_sum, nothing, /)
check_aggregate(::typeof(sum∘skipmissing)) = Reduce(Base.add_sum, !ismissing)
check_aggregate(::typeof(prod∘skipmissing)) = Reduce(Base.mul_prod, !ismissing)
validate_aggregate(::typeof(mean), ::AbstractVector{<:Union{Missing, Number}}) = true

check_aggregate(::typeof(mean∘skipmissing)) = Reduce(Base.add_sum, !ismissing, /)
check_aggregate(::typeof(maximum∘skipmissing)) = Reduce(max, !ismissing, nothing, true)
check_aggregate(::typeof(minimum∘skipmissing)) = Reduce(min, !ismissing, nothing, true)
validate_aggregate(::typeof(mean∘skipmissing), ::AbstractVector{<:Union{Missing, Number}}) = true

# Other aggregate functions which are not strictly reductions
struct Aggregate{F, C} <: AbstractAggregate
Expand All @@ -780,14 +805,36 @@ end
Aggregate(f) = Aggregate(f, nothing)

check_aggregate(::typeof(var)) = Aggregate(var)
validate_aggregate(::typeof(var), ::AbstractVector{<:Union{Missing, Number}}) = true

check_aggregate(::typeof(var∘skipmissing)) = Aggregate(var, !ismissing)
validate_aggregate(::typeof(var∘skipmissing), ::AbstractVector{<:Union{Missing, Number}}) = true

check_aggregate(::typeof(std)) = Aggregate(std)
validate_aggregate(::typeof(std), ::AbstractVector{<:Union{Missing, Number}}) = true

check_aggregate(::typeof(std∘skipmissing)) = Aggregate(std, !ismissing)
validate_aggregate(::typeof(std∘skipmissing), ::AbstractVector{<:Union{Missing, Number}}) = true

check_aggregate(::typeof(first)) = Aggregate(first)
validate_aggregate(::typeof(first), v::AbstractVector) = eltype(v) === Any ? false : true
validate_aggregate(::typeof(first), ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this case is a bit ugly, as code that generates a single column when a data frame stores only scalars will suddenly create multiple columns when it stores e.g. named tuples. Of course this is a more general problem than first and optimized reductions. I wonder whether we should require explicitly mentioning that you want the result to be destructured into multiple columns, e.g. via :x => MultiCol(x -> ...). Without this, we basically consider that only scalars should be stored in data frames, or many things may break.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We indeed can discuss this and this relates to what @pdeffebach wants that we allow destruction into multiple columns.

Just to be clear the MULTI_COLS_TYPE part is not really problematic in my opinion - it just makes sure we throw an error (as we disallow it now), so in the future we can process it correctly. This is what we try to do consistently everywhere, so that post 1.0 changes here will be non-breaking, as they will currently throw an error (this is your pet trick to handle SemVer AFAICT 😄).

A more problematic case is AbstractVector as it was simply inconsistent between slow and fast paths (because fast path was not expanding it and slow path does pseudo-broadcasting).

So there actually two questions:

  • if we want some MultiCol wrapper in the future or just allow multiple cols to be returned and silently accept it (I thought @pdeffebach wanted to silently accept this)
  • do we require to opt-in for pseudo-broadcasting, we never required it in the past, and it was the default, I think we should keep it.

Although it is a legacy from the very distant past. If I were designing it now I would never expand anything neither in rows nor in columns, just store one result per group in combine and then say to users to use flatten to flatten it. But this is a completely different design in comparison to what we have now, so this is just a side comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I had forgotten that we don't allow returning MULTI_COLS_TYPE currently. So at least that's safe.

AbstractVector remains problematic though. I wonder whether there are strong use cases for pseudo-broadcasting. It would be safer to make this opt-in for 1.0, otherwise we don't allow working with data frames whose cells contain vectors.

(BTW, instead for MultiCol, maybe we could reuse AsTable, but to wrap the returned value this time.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have thought about it when cleaning up the rules of pseudo-broadcasting recently. Here is what we have on 0.21:

julia> df = DataFrame(g=[1,2,3], x=[1:3, 4:6, 7:9])
3×2 DataFrame
│ Row │ g     │ x        │
│     │ Int64 │ UnitRan… │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 1:3      │
│ 2   │ 2     │ 4:6      │
│ 3   │ 3     │ 7:9      │

julia> gdf = groupby(df, :g)
GroupedDataFrame with 3 groups based on key: g
First Group (1 row): g = 1
│ Row │ g     │ x        │
│     │ Int64 │ UnitRan… │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 1:3      │
⋮
Last Group (1 row): g = 3
│ Row │ g     │ x        │
│     │ Int64 │ UnitRan… │
├─────┼───────┼──────────┤
│ 1   │ 3     │ 7:9      │

julia> combine(gdf, :x => first) # this is a wrong result and will be fixed by this PR to match what we have below
3×2 DataFrame
│ Row │ g     │ x_first  │
│     │ Int64 │ UnitRan… │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 1:3      │
│ 2   │ 2     │ 4:6      │
│ 3   │ 3     │ 7:9      │

julia> combine(gdf, :x => x -> first(x)) # this is the intended output
9×2 DataFrame
│ Row │ g     │ x_function │
│     │ Int64 │ Int64      │
├─────┼───────┼────────────┤
│ 1   │ 1     │ 1          │
│ 2   │ 1     │ 2          │
│ 3   │ 1     │ 3          │
│ 4   │ 2     │ 4          │
│ 5   │ 2     │ 5          │
│ 6   │ 2     │ 6          │
│ 7   │ 3     │ 7          │
│ 8   │ 3     │ 8          │
│ 9   │ 3     │ 9          │

julia> combine(gdf, :x => Ref∘first) # this is a currently intended method to protect from unwrapping - just like in standard broadcasting
3×2 DataFrame
│ Row │ g     │ x_function │
│     │ Int64 │ UnitRange… │
├─────┼───────┼────────────┤
│ 1   │ 1     │ 1:3        │
│ 2   │ 2     │ 4:6        │
│ 3   │ 3     │ 7:9        │

So in short - we allow working with data frames that contain vectors as:

  1. normally vectors of vectors will be returned and it is not a problem
  2. if user unwraps the vector (like above - with first) then Ref can be used as in Base to protect the result

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's continue this discussion in a separate issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - can you please open the issue explaining what you would want to change? (given the three key considerations: 1) currently we unwrap vectors, 2) Ref protects, 3) in the future we want to add support for multiple columns passing)


check_aggregate(::typeof(first∘skipmissing)) = Aggregate(first, !ismissing)
validate_aggregate(::typeof(first∘skipmissing), v::AbstractVector) = eltype(v) === Any ? false : true
validate_aggregate(::typeof(first∘skipmissing), ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = false

check_aggregate(::typeof(last)) = Aggregate(last)
validate_aggregate(::typeof(last), v::AbstractVector) = eltype(v) === Any ? false : true
validate_aggregate(::typeof(last), ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = false

check_aggregate(::typeof(last∘skipmissing)) = Aggregate(last, !ismissing)
validate_aggregate(::typeof(last∘skipmissing), v::AbstractVector) = eltype(v) === Any ? false : true
validate_aggregate(::typeof(last∘skipmissing), ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = false

check_aggregate(::typeof(length)) = Aggregate(length)
validate_aggregate(::typeof(length), ::AbstractVector) = true

# SkipMissing does not support length

# Find first value matching condition for each group
Expand Down Expand Up @@ -864,7 +911,11 @@ function groupreduce_init(op, condf, adjust,
if isconcretetype(Tnm) && applicable(initf, Tnm)
tmpv = initf(Tnm)
initv = op(tmpv, tmpv)
x = adjust isa Nothing ? initv : adjust(initv, 1)
if adjust isa Nothing
x = Tnm <: AbstractIrrational ? float(initv) : initv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too bad that irrationals don't define zero and one so that zero(x) + zero(x) has the same type as x + x... I don't remember offhand why they return Bool but that may have to do with the fact that irrationals promote to whatever type they are combined with.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not designed it 😄, but actually it is inconsistent:

julia> zero(pi) + pi
3.141592653589793

and you get Float64. I have commented in JuliaLang/julia#36978 to keep track of it.

else
x = adjust(initv, 1)
end
if condf === !ismissing
V = typeof(x)
else
Expand Down Expand Up @@ -900,7 +951,8 @@ for (op, initf) in ((:max, :typemin), (:min, :typemax))
# It is safe to use a non-missing init value
# since missing will poison the result if present
# we assume here that groups are non-empty (current design assures this)
if isconcretetype(S) && hasmethod($initf, Tuple{S})
# + workaround for https://github.com/JuliaLang/julia/issues/36978
if isconcretetype(S) && hasmethod($initf, Tuple{S}) && !(S <: Irrational)
fill!(outcol, $initf(S))
else
fillfirst!(condf, outcol, incol, gd)
Expand Down Expand Up @@ -1038,10 +1090,8 @@ function (agg::Aggregate{typeof(length)})(incol::AbstractVector, gd::GroupedData
end
end

isagg(p::Pair) =
check_aggregate(last(p)) isa AbstractAggregate && first(p) isa ColumnIndex

const MULTI_COLS_TYPE = Union{AbstractDataFrame, NamedTuple, DataFrameRow, AbstractMatrix}
isagg((col, fun)::Pair, gdf::GroupedDataFrame) =
col isa ColumnIndex && validate_aggregate(fun, parent(gdf)[!, col])

function _agg2idx_map_helper(idx, idx_agg)
agg2idx_map = fill(-1, length(idx))
Expand Down Expand Up @@ -1101,11 +1151,11 @@ function _combine(f::AbstractVector{<:Pair},
end

idx_agg = nothing
if length(gd) > 0 && any(isagg, f)
if length(gd) > 0 && any(x -> isagg(x, gd), f)
# Compute indices of representative rows only once for all AbstractAggregates
idx_agg = Vector{Int}(undef, length(gd))
fillfirst!(nothing, idx_agg, 1:length(gd.groups), gd)
elseif length(gd) == 0 || !all(isagg, f)
elseif length(gd) == 0 || !all(x -> isagg(x, gd), f)
# Trigger computation of indices
# This can speed up some aggregates that would not trigger this on their own
@assert gd.idx !== nothing
Expand All @@ -1114,7 +1164,7 @@ function _combine(f::AbstractVector{<:Pair},
parentdf = parent(gd)
for (i, p) in enumerate(f)
source_cols, fun = p
if length(gd) > 0 && isagg(p)
if length(gd) > 0 && isagg(p, gd)
incol = parentdf[!, source_cols]
agg = check_aggregate(last(p))
outcol = agg(incol, gd)
Expand Down
118 changes: 118 additions & 0 deletions test/grouping.jl
Original file line number Diff line number Diff line change
Expand Up @@ -2691,4 +2691,122 @@ end
@test map(nrow, gdf) == [1, 1, 1]
end

@testset "isagg fix" begin
bkamins marked this conversation as resolved.
Show resolved Hide resolved
for fun in (sum, prod, mean, var, std, sum∘skipmissing, prod∘skipmissing,
mean∘skipmissing, var∘skipmissing, std∘skipmissing),
col in ([1, 2, 3], [big(1.5), big(2.5), big(3.5)], [1 + 0.5im, 2 + 0.5im, 3 + 0.5im],
[true, false, true], [pi, pi, pi], [1//2, 1//3, 1//4],
Real[1, 1.5, 1//2], Number[1, 1.5, 1//2], Any[1, 1.5, 1//2],
[1, 2, missing], [big(1.5), big(2.5), missing], [1 + 0.5im, 2 + 0.5im, missing],
[true, false, missing], [pi, pi, missing], [1//2, 1//3, missing],
Union{Missing,Real}[1, 1.5, missing],
Union{Missing,Number}[1, 1.5, missing], Any[1, 1.5, missing])
gdf = groupby_checked(DataFrame(g=[1, 1, 1], x=col), :g)
@test combine(gdf, :x => fun => :y) ≅ combine(gdf, :x => (x -> fun(x)) => :y)
bkamins marked this conversation as resolved.
Show resolved Hide resolved
end

for fun in (maximum, minimum, maximum∘skipmissing, minimum∘skipmissing),
col in ([1, 2, 3], [big(1.5), big(2.5), big(3.5)],
[true, false, true], [pi, pi, pi], [1//2, 1//3, 1//4],
Real[1, 1.5, 1//2], Number[1, 1.5, 1//2], Any[1, 1.5, 1//2],
[1, 2, missing], [big(1.5), big(2.5), missing],
[true, false, missing], [pi, pi, missing], [1//2, 1//3, missing],
Union{Missing,Real}[1, 1.5, missing],
Union{Missing,Number}[1, 1.5, missing], Any[1, 1.5, missing])
gdf = groupby_checked(DataFrame(g=[1, 1, 1], x=col), :g)
@test combine(gdf, :x => fun => :y) ≅ combine(gdf, :x => (x -> fun(x)) => :y)
end

for fun in (first, last, length, first∘skipmissing, last∘skipmissing),
col in ([1, 2, 3], [big(1.5), big(2.5), big(3.5)], [1 + 0.5im, 2 + 0.5im, 3 + 0.5im],
[true, false, true], [pi, pi, pi], [1//2, 1//3, 1//4],
Real[1, 1.5, 1//2], Number[1, 1.5, 1//2], Any[1, 1.5, 1//2],
[1, 2, missing], [big(1.5), big(2.5), missing], [1 + 0.5im, 2 + 0.5im, missing],
[true, false, missing], [pi, pi, missing], [1//2, 1//3, missing],
Union{Missing,Real}[1, 1.5, missing],
Union{Missing,Number}[1, 1.5, missing], Any[1, 1.5, missing])
gdf = groupby_checked(DataFrame(g=[1, 1, 1], x=col), :g)
if fun isa typeof(last∘skipmissing)
bkamins marked this conversation as resolved.
Show resolved Hide resolved
# this is another hard corner case
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this special case? last(skipmissing(x)) doesn't work outside DataFrames (maybe it should though), and it doesn't seem to work with any type in DataFrames either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a special case, because it works in fast path, but fails in slow path. I have changed the comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it works only with GroupedDataFrame input, not with a DataFrame input. I don't get why.

Also, I thought the goal of this PR was to make the slow and fast path behave the same, so shouldn't this always throw an error whatever the eltype?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not see the benefit of last∘skipmissing throwing an error in cases when we can efficiently produce a correct result. The fact that last∘skipmissing errors in Base is only due to O(1) restriction in last docstring, e.g. first∘skipmissing works in base because its docstring does not require O(1). I think we should change in Base that last∘skipmissing works.

it works only with GroupedDataFrame input, not with a DataFrame input. I don't get why.

It should work in combine both for GroupedDataFrame and DataFrame and with select on GroupedDataFrame (and pseudo-broadcasting is applied in this case). It is expected to fail for select on DataFrame. Do you observe a different behaviour?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, last(skipmissing(x)) and lastindex(skipmissing(x)) should probably be defined when x is an AbstractArray, since they don't require going over the whole collection.

It should work in combine both for GroupedDataFrame and DataFrame and with select on GroupedDataFrame (and pseudo-broadcasting is applied in this case). It is expected to fail for select on DataFrame. Do you observe a different behaviour?

I get this

julia> df = DataFrame(x=[1])
1×1 DataFrame
│ Row │ x     │
│     │ Int64 │
├─────┼───────┤
│ 11     │

julia> combine(df, :x => last  skipmissing)
ERROR: MethodError: no method matching lastindex(::Base.SkipMissing{Array{Int64,1}})

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - I have forgotten of this path. So you have:

julia> df = DataFrame(x=[1])
1×1 DataFrame
│ Row │ x     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │

julia> combine(df, :x => last ∘ skipmissing)
ERROR: MethodError: no method matching lastindex(::Base.SkipMissing{Array{Int64,1}})

julia> combine(:x => last ∘ skipmissing, df)
1×1 DataFrame
│ Row │ x_last_skipmissing │
│     │ Int64              │
├─────┼────────────────────┤
│ 1   │ 1                  │

which is unfortunate.

I would not touch it though now as it is not easy to fix.

In the long term we should rewrite the whole engine for select/transform/combine to be unified. Now - unfortunately - I have designed it in stages, and when initially implementing select and transform we have not envisioned that we will have a full ecosystem that we have now, so we essentially have two processing engines - one for combine in GroupedDataFrame and the other for select in DataFrame.

This should be implemented in a way that select/transform/combine on DataFrame are always essentially calls to GroupedDataFrame on a group formed by no columns (single group). Additional things to remember when implementing it:

  • add better support for 0 groups
  • add possibility for a group to have zero rows
  • add requested extensions to the syntax of source_columns => function => target_col_name

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. No hurry, maybe just copy your comment to an issue to keep your analysis available.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this is tracked in separate issues.

if eltype(col) === Any
@test_throws MethodError combine(gdf, :x => fun => :y)
else
@test combine(gdf, :x => fun => :y) ==
combine(groupby_checked(dropmissing(parent(gdf)), :g), :x => fun => :y)
end
@test_throws MethodError combine(gdf, :x => (x -> fun(x)) => :y)
else
@test combine(gdf, :x => fun => :y) ≅ combine(gdf, :x => (x -> fun(x)) => :y)
end
end

for fun in (sum, mean, var, std),
col in ([1:3, 4:6, 7:9], [1:3, 4:6, missing])
gdf = groupby_checked(DataFrame(g=[1, 1, 1], x=col), :g)
if eltype(col) >: Missing
@test_throws MethodError combine(gdf, :x => fun => :y)
@test_throws MethodError combine(gdf, :x => (x -> fun(x)) => :y)
else
@test combine(gdf, :x => fun => :y) ≅ combine(gdf, :x => (x -> fun(x)) => :y)
end
end

for fun in (sum∘skipmissing, mean∘skipmissing),
col in ([1:3, 4:6, 7:9], [1:3, 4:6, missing])
gdf = groupby_checked(DataFrame(g=[1, 1, 1], x=col), :g)
@test combine(gdf, :x => fun => :y) ≅ combine(gdf, :x => (x -> fun(x)) => :y)
end

# workaround https://github.com/JuliaLang/julia/issues/36979
nalimilan marked this conversation as resolved.
Show resolved Hide resolved
for fun in (var∘skipmissing, std∘skipmissing),
col in ([1:3, 4:6, 7:9], [1:3, 4:6, missing])
gdf = groupby_checked(DataFrame(g=[1, 1, 1], x=col), :g)
@test_throws MethodError combine(gdf, :x => fun => :y)
@test_throws MethodError combine(gdf, :x => (x -> fun(x)) => :y)
end

for fun in (maximum, minimum, maximum∘skipmissing, minimum∘skipmissing,
first, last, length, first∘skipmissing, last∘skipmissing),
col in ([1:3, 4:6, 7:9], [1:3, 4:6, missing])
gdf = groupby_checked(DataFrame(g=[1, 1, 1], x=col), :g)
if fun isa typeof(last∘skipmissing)
@test_throws MethodError combine(gdf, :x => fun => :y)
@test_throws MethodError combine(gdf, :x => (x -> fun(x)) => :y)
else
@test combine(gdf, :x => fun => :y) ≅ combine(gdf, :x => (x -> fun(x)) => :y)
end
end

for fun in (prod, prod∘skipmissing),
col in ([1:3, 4:6, 7:9], [1:3, 4:6, missing])
gdf = groupby_checked(DataFrame(g=[1, 1, 1], x=col), :g)
@test_throws MethodError combine(gdf, :x => fun => :y)
@test_throws MethodError combine(gdf, :x => (x -> fun(x)) => :y)
end

for fun in (sum, prod, mean, var, std, sum∘skipmissing, prod∘skipmissing,
mean∘skipmissing, var∘skipmissing, std∘skipmissing,
maximum, minimum, maximum∘skipmissing, minimum∘skipmissing,
first, last, length, first∘skipmissing, last∘skipmissing),
col in ([ones(2,2), zeros(2,2), ones(2,2)], [ones(2,2), zeros(2,2), missing],
[DataFrame(ones(2,2)), DataFrame(zeros(2,2)), DataFrame(ones(2,2))],
[DataFrame(ones(2,2)), DataFrame(zeros(2,2)), ones(2,2)],
[DataFrame(ones(2,2)), DataFrame(zeros(2,2)), missing],
[(a=1, b=2), (a=3, b=4), (a=5, b=6)], [(a=1, b=2), (a=3, b=4), missing])
gdf = groupby_checked(DataFrame(g=[1, 1, 1], x=col), :g)
if fun isa typeof(length)
bkamins marked this conversation as resolved.
Show resolved Hide resolved
@test combine(gdf, :x => fun => :y) ≅ DataFrame(g=1, y=3)
@test combine(gdf, :x => (x -> fun(x)) => :y) ≅ DataFrame(g=1, y=3)
elseif (fun isa typeof(last) && ismissing(last(col))) ||
(fun isa Union{typeof(maximum), typeof(minimum)} && col ≅ [(a=1, b=2), (a=3, b=4), missing])
bkamins marked this conversation as resolved.
Show resolved Hide resolved
# this case is a hard problem what to do and probably we have to leave it as is
nalimilan marked this conversation as resolved.
Show resolved Hide resolved
@test combine(gdf, :x => fun => :y) ≅ DataFrame(g=1, y=missing)
@test combine(gdf, :x => (x -> fun(x)) => :y) ≅ DataFrame(g=1, y=missing)
else
@test_throws Union{ArgumentError, MethodError} combine(gdf, :x => fun => :y)
@test_throws Union{ArgumentError, MethodError} combine(gdf, :x => (x -> fun(x)) => :y)
end
end
end

end # module