Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR: AssertionError: length(res) > 0 #3217

Closed
ericphanson opened this issue Nov 7, 2022 · 13 comments · Fixed by #3222
Closed

ERROR: AssertionError: length(res) > 0 #3217

ericphanson opened this issue Nov 7, 2022 · 13 comments · Fixed by #3222
Labels
Milestone

Comments

@ericphanson
Copy link
Contributor

ericphanson commented Nov 7, 2022

I see the following on both DataFrames 1.3.6 and 1.4.2:

julia> df = DataFrame(:a => Int[])
0×1 DataFrame

julia> combine(groupby(df, :a), :a => (x -> [(; x)]) => AsTable)
ERROR: AssertionError: length(res) > 0
Stacktrace:
 [1] _combine(gd::GroupedDataFrame{DataFrame}, cs_norm::Vector{Any}, optional_transform::Vector{Bool}, copycols::Bool, keeprows::Bool, renamecols::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/JZ7x5/src/groupeddataframe/splitapplycombine.jl:604
 [2] _combine_prepare_norm(gd::GroupedDataFrame{DataFrame}, cs_vec::Vector{Any}, keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/JZ7x5/src/groupeddataframe/splitapplycombine.jl:84
 [3] _combine_prepare(gd::GroupedDataFrame{DataFrame}, ::Base.RefValue{Any}; keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/JZ7x5/src/groupeddataframe/splitapplycombine.jl:50
 [4] combine(::GroupedDataFrame{DataFrame}, ::Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, Type, All, Between, Cols, InvertedIndex, AbstractVecOrMat}, ::Vararg{Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, Type, All, Between, Cols, InvertedIndex, AbstractVecOrMat}}; keepkeys::Bool, ungroup::Bool, renamecols::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/JZ7x5/src/groupeddataframe/splitapplycombine.jl:710
 [5] combine(gd::GroupedDataFrame{DataFrame}, args::Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, Type, All, Between, Cols, InvertedIndex, AbstractVecOrMat})
   @ DataFrames ~/.julia/packages/DataFrames/JZ7x5/src/groupeddataframe/splitapplycombine.jl:710
 [6] top-level scope
   @ REPL[40]:1

caused by: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:345 [inlined]
 [2] _combine(gd::GroupedDataFrame{DataFrame}, cs_norm::Vector{Any}, optional_transform::Vector{Bool}, copycols::Bool, keeprows::Bool, renamecols::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/JZ7x5/src/groupeddataframe/splitapplycombine.jl:600
 [3] _combine_prepare_norm(gd::GroupedDataFrame{DataFrame}, cs_vec::Vector{Any}, keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/JZ7x5/src/groupeddataframe/splitapplycombine.jl:84
 [4] _combine_prepare(gd::GroupedDataFrame{DataFrame}, ::Base.RefValue{Any}; keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/JZ7x5/src/groupeddataframe/splitapplycombine.jl:50
 [5] combine(::GroupedDataFrame{DataFrame}, ::Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, Type, All, Between, Cols, InvertedIndex, AbstractVecOrMat}, ::Vararg{Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, Type, All, Between, Cols, InvertedIndex, AbstractVecOrMat}}; keepkeys::Bool, ungroup::Bool, renamecols::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/JZ7x5/src/groupeddataframe/splitapplycombine.jl:710
 [6] combine(gd::GroupedDataFrame{DataFrame}, args::Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, Type, All, Between, Cols, InvertedIndex, AbstractVecOrMat})
   @ DataFrames ~/.julia/packages/DataFrames/JZ7x5/src/groupeddataframe/splitapplycombine.jl:710
 [7] top-level scope
   @ REPL[40]:1

    nested task error: AssertionError: length(res) > 0
    Stacktrace:
     [1] _combine_process_pair_astable(optional_i::Bool, gd::GroupedDataFrame{DataFrame}, seen_cols::Dict{Symbol, Tuple{Bool, Int64}}, trans_res::Vector{DataFrames.TransformationResult}, idx_agg::Base.RefValue{Vector{Int64}}, out_col_name::Type{AsTable}, firstmulticol::Bool, ::Base.RefValue{Any}, wfun::Base.RefValue{Any}, wincols::Base.RefValue{Any})
       @ DataFrames ~/.julia/packages/DataFrames/JZ7x5/src/groupeddataframe/splitapplycombine.jl:403
     [2] _combine_process_pair(::Base.RefValue{Any}, optional_i::Bool, parentdf::DataFrame, gd::GroupedDataFrame{DataFrame}, seen_cols::Dict{Symbol, Tuple{Bool, Int64}}, trans_res::Vector{DataFrames.TransformationResult}, idx_agg::Base.RefValue{Vector{Int64}})
       @ DataFrames ~/.julia/packages/DataFrames/JZ7x5/src/groupeddataframe/splitapplycombine.jl:506
     [3] macro expansion
       @ ~/.julia/packages/DataFrames/JZ7x5/src/groupeddataframe/splitapplycombine.jl:592 [inlined]
     [4] (::DataFrames.var"#659#665"{GroupedDataFrame{DataFrame}, Bool, Bool, DataFrame, Dict{Symbol, Tuple{Bool, Int64}}, Vector{DataFrames.TransformationResult}, Base.RefValue{Vector{Int64}}, Bool, Pair{Int64, Pair{var"#51#52", DataType}}})()
       @ DataFrames ./threadingconstructs.jl:258

If the input is wrong, I don't think this should be an AssertionError. Maybe an ArgumentError, or ErrorException; assertions should be only for logical errors in the program, not input based errors. xref https://docs.julialang.org/en/v1/base/base/#Base.@assert

Although in this case I am not sure if it is an actual logic error being caught or an incorrect use of @assert and the error should be something else.

@bkamins bkamins added the bug label Nov 7, 2022
@bkamins bkamins added this to the patch milestone Nov 7, 2022
@bkamins
Copy link
Member

bkamins commented Nov 7, 2022

The current code is a bug, but let us discuss what we prefer:

  • throwing an ArgumentError then stating something like "cannot determine the columns in the output because no data was produced"
  • trying to guess the output, which seems to be possible in some cases, e.g. I would implement the following detection logic:
        if length(res) > 0
            kp1 = keys(res[1])
        else
            et = eltype(res)
            if isconcretetype(et)
                if et <: Tuple
                    kp1 = length(et.parameters)
                elseif et <: NamedTuple
                    kp1 = first(et.parameters)
                else
                    kp1 = Symbol[]
                end
            else
                kp1 = Symbol[]
            end
        end

Then for Tuple and NamedTuple (if their concrete type can be inferred by the compiler) we get the correct column names, and for other types we would just assume that no columns are produced. The only problem I have with this is that such solution feels a bit hacky. Maybe it is safer to throw an error?

CC @nalimilan

@ericphanson
Copy link
Contributor Author

ericphanson commented Nov 7, 2022

I think this might be a clearer case (better matching my actual use) where there's maybe a way it could reasonable work:

julia> df = DataFrame(:a => Int[]);

julia> f(x) = [(; x=sum(x))]
f (generic function with 1 method)

julia> f(Int[])
1-element Vector{NamedTuple{(:x,), Tuple{Int64}}}:
 (x = 0,)

julia> combine(groupby(df, :a), :a => f => AsTable)
ERROR: AssertionError: length(res) > 0
Stacktrace:
 [1] _combine(gd::GroupedDataFrame{DataFrame}, cs_norm::Vector{Any}, optional_transform::Vector{Bool}, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/bza1S/src/groupeddataframe/splitapplycombine.jl:739

Based on #2297 I thought this would evaluate f(Int[]) to figure out the output columns, but on second thought I think that issue only covers the case

julia> combine(groupby(df, :a), :a => sum)
0×2 DataFrame
 Row │ a      a_sum
     │ Int64  Int64
─────┴──────────────

Anyway, my vote would be for a solution that does not rely on type inference, and throws a clear errors when the output columns cannot be determined. However it would be nice if there was a way to write code like this that can propagate columns in the empty case. Maybe something like calling https://tables.juliadata.org/dev/#Tables.schema to see if the schema can be determined.

@bkamins
Copy link
Member

bkamins commented Nov 7, 2022

I thought this would evaluate f(Int[]) to figure out the output columns

It does. However, normally it is used to derive type of the column, while for AsTable target type is often not enough (it is enough in other cases as we do not have to check how many columns are created). It is enough only in cases where column type encodes element structure (as it needs to be deconstructed). This is the case for Tuple and NamedTuple, but not the case for e.g. Dict, or Vector.

Also the issue is that with AsTable target what we say is that the number and names of columns are derived from the returned elements and they must be the same. So, e.g., it is not sure if this first f(Int[]) call does not return something completely different than would be returned in other cases.

So this means that we have to hardcode cases when this is doable + decide what to do when this is not doable (throw an error or create zero columns - as in my proposed code above). What do you think?

@bkamins
Copy link
Member

bkamins commented Nov 7, 2022

Alternatively we could throw an error when we cannot determine target columns:

        if length(res) > 0
            kp1 = keys(res[1])
        else
            et = eltype(res)
            if isconcretetype(et)
                if et <: Tuple
                    kp1 = length(et.parameters)
                elseif et <: NamedTuple
                    kp1 = first(et.parameters)
                else
                    throw(ArgumentError("operation returned no rows; cannot determine target columns"))
                end
            else
                throw(ArgumentError("operation returned no rows; cannot determine target columns"))
            end
        end

In general problematic cases are code like this:

combine(groupby(df, :a), :a => (x -> [(a=x, b=x), (a=x, c=x)]) => AsTable) # throws error correctly
combine(groupby(df, :a), :a => (x -> [(a=x, b=x), (a=x, b=copy(x))]) => AsTable) # throws error incorrectly, as column names could be determined
combine(groupby(df, :a), :a => (x -> [(a=x, b=x), (a=x, b=x)]) => AsTable) # works correctly
combine(groupby(df, :a), :a => (x -> [(a=x, b=1), (a=x, b=1)]) => AsTable) # works incorrectly since the returned value is invalid

In summary: to get it fully right there are many corner cases to consider. For this reason initially we assumed that something should be returned and made an assertion.

@nalimilan - this lead me to thinking that the decision in #2297 sometimes can be questionable, e.g.:

julia> combine(groupby(df, :a), :a => x -> isempty(x) ? ('a':'b') : [1,2]) |> Tables.columntable
(a = Int64[], a_function = Char[])

@nalimilan
Copy link
Member

Relying on inference for this sounds problematic, as a change in the compiler could make code suddenly error (and we don't even provide a workaround to specify expected output types manually). Yet the last example you give shows that using the type of the result when called on an empty group is also problematic. We could add an argument to use that value instead of throwing an error if the user so wishes -- that's pretty obscure, but the error could mention it. I don't really know what else we can do otherwise.

I wish there would be a way to detect when literal column names are used to build a named tuple, as it's a common case. DataFramesMeta can special case this kind of pattern as it can access the syntax.

combine(groupby(df, :a), :a => (x -> [(a=x, b=1), (a=x, b=1)]) => AsTable) # works incorrectly since the returned value is invalid

This case throws an error. Am I missing something?

@bkamins
Copy link
Member

bkamins commented Nov 8, 2022

This case throws an error. Am I missing something?

Now it errors. It would work if we relied on inference. The reason is that then we would identify that column names are :a and :b and would think that all is OK, but it would not be as the named tuple mixes vectors and scalars. Again - we could be even more careful and check against this case.

What I think we can do currently:

  1. for AsTable output - change @assert to ArgumentError that would give a clear information why the operation fails. We can change it in the future if someone asks for this functionality. I understand that @ericphanson just asks for more clarity now (i.e. @ericphanson - please confirm that you do not require this functionality, just want a better handling of this case - right?)
  2. for other outputs - we need to keep what we implemented in combine/select/transform on empty GroupedDataFrame #2297 as changing this would be breaking. I would just add an information to the documentation how things behave.

So in summary I propose two actions:

  1. change assertion to ArgumentError for table output.
  2. add information to documentation how empty GroupedDataFrame is handled.

The reason for such a proposal is that I expect that aggregation of empty GroupedDataFrame is rare and in practice not really needed (but maybe I am wrong?) and therefore adding a kwarg to handle this case seems an overkill (the mechanics of aggregation of GroupedDataFrame is already quite complicated). Fortunately I am @assert fan which allowed @ericphanson to catch this case 😄.

@ericphanson
Copy link
Contributor Author

Yeah, I do not require this, I was just confused by the error and thought something was wrong. I refactored my code already to manually handle the empty case. Perhaps this can be added as a caveat for AsTable, since IIUC, this kind of issue should go away if I manually specified the output columns.

@bkamins
Copy link
Member

bkamins commented Nov 8, 2022

IIUC, this kind of issue should go away if I manually specified the output columns.

If you mean this:

combine(groupby(df, :a), :a => (x -> [(a=x, b=x)]) => [:p, :q])

then it will still error. The reason is that we need to capture not only column names but also column types.

So to make it clear: empty GroupedDataFrame is now handled for single column output (with the problems I have commented about above - which I will document). For multi column output - it now errors in all cases.

The question is: is it OK that it errors for multi column output always or should we try to handle this case?

@ericphanson
Copy link
Contributor Author

ericphanson commented Nov 8, 2022

Ah, I see, OK. I think it is OK and a clear error is better than an obscure one, but it would be a nice to have some clean way to handle the empty case.

What if we could do

combine(groupby(df, :a), :a => (x -> [(a=x, b=x)]) => AsTable(schema))

where schema is a Tables.jl Schema object? (These could be easier to write w JuliaData/Tables.jl#314, fwiw). I think they aren't super widely used right now but it could be the "right" way to encode this stuff and could be more common w more support.

@bkamins
Copy link
Member

bkamins commented Nov 9, 2022

After thinking about it more the issue can be cleanly resolved I think using the following code:

        if isempty(res)
            res = firstres
        end
        kp1 = isempty(res) ? () : kp1 = keys(res[1])

The logic is:

  • if we have no groups use the "dummy" firstres to perform all the operations of column detection (both type and name);
  • if "dummy" firstres is empty then create no columns.

This means that we do not rely on type inference, but follow the current documentation: create columns that have keys following keys of the returned values (so in particular if we have no groups and also "dummy" firstres has no rows we create no columns).

I will need to think about it a bit more and test.

@bkamins
Copy link
Member

bkamins commented Nov 9, 2022

Fixed in #3222

@nalimilan
Copy link
Member

OK, so the idea is that even when we pass zero rows, we pass columns with the right type, so we expect that either the user-provided function returns values of the same type whether there are zero or more rows, or the user doesn't care that the result may have a different type when the input GroupedDataFrame is empty and when it's not. Sounds reasonable.

@bkamins
Copy link
Member

bkamins commented Nov 12, 2022

Yes - this was the assumption previously (when doing a single column output PR) and we keep it.

In other words: we assume that the function the user provided behaves in the same way if it gets data from a group (passed vectors will have positive length) and when it there are no groups (passed vectors will have zero length).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants