Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

combine/select/transform on empty GroupedDataFrame #2297

Closed
bkamins opened this issue Jun 19, 2020 · 14 comments · Fixed by #2324
Closed

combine/select/transform on empty GroupedDataFrame #2297

bkamins opened this issue Jun 19, 2020 · 14 comments · Fixed by #2324
Labels
Milestone

Comments

@bkamins
Copy link
Member

bkamins commented Jun 19, 2020

Currently combine/select on GroupedDataFrame with 0 groups just executes no operation and returns a result with grouping columns only (or empty). This is a bit inconsistent as the output shape of the operation depends on the number of groups.

We have several options here:

  1. leave things as they are
  2. throw an error if length(::GroupedDataFrame) is 0 (this is a safest option and probably is not a problem in practice as probably almost no real use cases combine empty groups)
  3. Throw a dummy input (i.e. with correct shape and element types but with zero rows) to grouping functions and just capture the shape and eltype of the output

Point 3. is related to a more general issue of allowing empty groups (we currently disallow them). They naturally happen if e.g. we would like to do aggregation over all levels of a categorical column, where not all levels are present.

Option 2. (throw an error) is a safe choice for 1.0, as later we would have an option to switch to any behaviour we like.

@bkamins bkamins added breaking The proposed change is breaking. decision grouping labels Jun 19, 2020
@bkamins bkamins added this to the 1.0 milestone Jun 19, 2020
@Nosferican
Copy link
Contributor

Option 1

@Nosferican
Copy link
Contributor

Option 2

@Nosferican
Copy link
Contributor

Option 3

@alecloudenback
Copy link
Contributor

Is this similar to #1837?

@bkamins
Copy link
Member Author

bkamins commented Jun 19, 2020

Is this similar to #1837?

No it is orthogonal. The question is what should happen when you work on a GroupedDataFrame like groupby(DataFrame(a=[], b=[], c=[]), [:a, :b]).

@nalimilan
Copy link
Member

Option 3 is probably the best in the long term, but Option 2 is good in the short term as it allows doing anything after 1.0.

@bkamins
Copy link
Member Author

bkamins commented Jun 23, 2020

Actually we have to go for option 3 to fix a bug in combine called on a DataFrame that has 0-rows.

@bkamins
Copy link
Member Author

bkamins commented Jul 5, 2020

OK, if we go for option 3 (which I agree) then the rules would be:

  1. for combine/select/transform on an empty gdf return a data frame with 0 rows
  2. for select/transform on an empty df return a data frame with 0 rows
  3. for combine on an empty df return a data frame with as many rows as would be implied by the returned values (this case breaks symmetry in the code, so I want to confirm that we agree that this is the desired behaviour before I start implementing it)

@nalimilan
Copy link
Member

2 is actually consistent with 3 AFAICT, it's just that select/transform additionally require the functions to return the same number of values as their inputs. But the functions would still be called on the empty inputs to find out the return type, right?

For 1, it's tricky since functions have to be called on an empty pseudo group to find out return types. Is that what you have in mind?

@bkamins
Copy link
Member Author

bkamins commented Jul 6, 2020

You have summarized it correctly - I just wanted to be explicit.

So now regarding implementation:

  1. requires an implementation of a new code path in _combine
  2. this is how things work now, so we do not have to change anything
  3. this is actually tricky to make things work as in the code path implemented for 1. we need to make another branch signaling that we want to retain the produced rows and I have to think how to do it in the most clean way.

To be clear where the challenge is if we accept these rules:

combine(x -> DataFrame(a=1), DataFrame())

will produce DataFrame(a=1) in result, but:

combine(x -> DataFrame(a=1), groupby(DataFrame(), []))

will produce DataFrame(a=Int[]) in result.

I am spelling this example out to double check that we all agree this is how things should work.

@Nosferican
Copy link
Contributor

combine(::Function, ::GroupedDataFrame) makes sense but didn't know combine(::Function, ::DataFrame) would work.
Is a DataFrame considered a GroupedDataFrame with 0 or 1 group?

@bkamins
Copy link
Member Author

bkamins commented Jul 6, 2020

DataFrame is not considered to be a GroupedDataFrame.

combine means "combine rows". You can combine rows in a GroupedDataFrame (then you combine them per group) or you can combine rows in a DataFrame (then you combine all rows in a data frame).


The key thing in what we discuss is a difference between 0 groups and one or more groups with zero rows per group.
The latter (zero rows per group) is currently disallowed (there were requests for this functionality though to allowing filling-in missing levels in CategoricalVector).

So technically you could say that DataFrame is considered to be a GroupedDataFrame with 1 group and no grouping columns, except that we have a corner case that 0-row data frame has to have 0 groups as we disallow 0 row groups.

@nalimilan
Copy link
Member

So technically you could say that DataFrame is considered to be a GroupedDataFrame with 1 group and no grouping columns, except that we have a corner case that 0-row data frame has to have 0 groups as we disallow 0 row groups.

Maybe we'd better consider a 0-row DataFrame as having 1 empty group. GroupedDataFrame doesn't allow that, but that's not necessarily a problem: that way there's no inconsistency, just something that works for DataFrame but not for GroupedDataFrame. And it could be supported later.

In doubt, we could throw an error when returning a non-zero number of rows from combine when the input has no rows. Not sure it's worth it.

@bkamins
Copy link
Member Author

bkamins commented Jul 7, 2020

Maybe we'd better consider a 0-row DataFrame as having 1 empty group. GroupedDataFrame doesn't allow that, but that's not necessarily a problem

This is the point of this issue to do exactly what you write 😄.

In doubt, we could throw an error when returning a non-zero number of rows from combine when the input has no rows.

It would not reduce the complexity of the code (I anyway have to "hardcode" a case "0-row DataFrame as having 1 empty group" for the purpose of combine) and would be inconsistent I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants