Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: groupby with many empty groups memory blowup #30552

Closed
jbrockmendel opened this issue Dec 29, 2019 · 36 comments · Fixed by #51811
Closed

PERF: groupby with many empty groups memory blowup #30552

jbrockmendel opened this issue Dec 29, 2019 · 36 comments · Fixed by #51811
Labels
Categorical Categorical Data Type Groupby Performance Memory or execution speed performance

Comments

@jbrockmendel
Copy link
Member

Suppose we have a Categorical with many unused categories:

cat = pd.Categorical(range(24), categories=range(10**5))

df = pd.DataFrame({"A": cat, "B": range(24), "C": range(24), "D": 1})

gb = df.groupby(["A", "B", "C"])

>>> gb.size()  # memory balloons to 9+ GB before i kill it

There are only 24 rows in this DataFrame, so we shouldn't be creating millions of groups.

Without the Categorical, but just a large cross-product that implies many empty groups, this works fine:

df = pd.DataFrame({n: range(12) for n in range(8)})

gb = df.groupby(list(range(7)))
gb.size() # <-- works fine
@jreback
Copy link
Contributor

jreback commented Dec 29, 2019

this is the point of the observed keyword

@simonjayhawkins simonjayhawkins added Categorical Categorical Data Type Groupby Performance Memory or execution speed performance labels Dec 30, 2019
@WillAyd
Copy link
Member

WillAyd commented Jan 2, 2020

So yea to second Jeff’s comment above is this a problem with observed=True?

@jbrockmendel
Copy link
Member Author

passing observed=True solves the problem. id like to add a warning or something for users who find themselves about to hit a MemoryError

@WillAyd
Copy link
Member

WillAyd commented Jan 2, 2020 via email

@TomAugspurger
Copy link
Contributor

I don't recall a discussion about that.

I vaguely recall that this will be somewhat solved by having a DictEncodedArray that has a similar data model to Categorical, without the unobserved / fixed categories semantics.

@WillAyd
Copy link
Member

WillAyd commented Jan 2, 2020

Hmm I think that is orthogonal. IIUC the memory blowup is because by default we are generating Cartesian products. Would rather just deprecate that and switch the default value for observed

@TomAugspurger
Copy link
Contributor

I think it’s a good default for the original design semantics of Categorical. It’s a bad default for the memory-saving aspect of categorical.

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Jan 3, 2020 via email

@jreback
Copy link
Contributor

jreback commented Jan 3, 2020

i suppose we would show a PerformanceWarning if we detect this would happen which is pretty cheap to do

might be worthwhile

@paulrougieux
Copy link

paulrougieux commented Jan 14, 2020

I cannot reproduce the issue above with cat = pd.Categorical(range(24), categories=range(10**5)) on pandas 0.25.3.

When upgrading from pandas 0.24.2 to 0.25.3, I had a memory issue with a groupby().agg() on a data frame with 100 000 rows and 11 columns. I used 8 grouping variables with a mix of categorical and character variables and the grouping operation was using over 8Gb of memory.

Setting the argument observed=True:

df.groupby(index, observed=True)

fixed the memory issue.

Related Stack Overflow question: Pandas v 0.25 groupby with many columns gives memory error.

Maybe observed=True should be the default? At least beyond a certain ratio of observed / all possible combinations. When the observed combinations of categorical values is way lower than all possible combination of these categorical values, it is clear that it doesn't make sense to use observed=False. Is there a discussion on why the default was set to observed=False?

@jangorecki
Copy link

What are the actions to be taken in this issue? Changing the default, depreciating argument, documenting, or something else? It is not clear from current discussion.
Could anyone provide a use case for a observed=False?

@TomAugspurger
Copy link
Contributor

We're not changing the behavior. The only proposal that's on the table is detecting and warning when we have / are about to allocate too much memory.

Could anyone provide a use case for a observed=False?

When you have a fixed set of categories that should persist across operations (e.g. survey results).

@jangorecki
Copy link

jangorecki commented Mar 23, 2020

Thanks. It is pretty significant regression, not very new one already, but still.

And how is it useful for grouping? Empty groups are being returned?
Usually it is addressed by a right outer join to such categorical dictionary.

@TomAugspurger
Copy link
Contributor

What's the regression? All categories are present in the output of a groupby by design, including unobserved categories.

@jangorecki
Copy link

jangorecki commented Mar 23, 2020

Regression is that code that worked fine on 0.24.2 is now hitting MemoryError

@TomAugspurger
Copy link
Contributor

Did it have the correct output in 0.24.2? In pandas 0.24.2 I have

In [8]: cat = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])

In [9]: df = pd.DataFrame({"A": cat, "B": [1, 2], "C": [3, 4]})

In [10]: df.groupby(["A", "B"]).size()
Out[10]:
A  B
a  1    1
b  2    1
dtype: int64

which is incorrect. So it sounds like you were relying on buggy behavior.

@paulrougieux
Copy link

paulrougieux commented Mar 25, 2020

@jangorecki
Could anyone provide a use case for a observed=False?

@TomAugspurger
When you have a fixed set of categories that should persist across operations (e.g. survey results).

In the context of multiple categorical variables used as the groupby index. Persisting a fixed set of categories (inside each categorical variable) is a different issue i.e. maybe the observed argument should be disentangled in two arguments: (1) one argument to simply keep unobserved categories. For example observed=False would tell pandas to keep unobserved categories inside the categorical variable (i.e. not change the categories). And (2) another argument would tell pandas to preallocate memory for all possible combination of categorical variables. For example preallocate=True would preallocate memory and preallocate=False would not.
But I might have misunderstood what @TomAugspurger meant.

@TomAugspurger
Copy link
Contributor

Sorry @paulrougieux, I don't follow your proposal. Could you add an example what the two keywords would do?

observed=True/False doesn't affect the dtype of the index. That will always be the same dtype as the grouper.

In [7]: pd.Series([1, 2, 3]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=True).sum().index.dtype
Out[7]: CategoricalDtype(categories=['a', 'b'], ordered=False)

In [8]: pd.Series([1, 2, 3]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=False).sum().index.dtype
Out[8]: CategoricalDtype(categories=['a', 'b'], ordered=False)

It only controls whether the aggregation function is applied to the unobserved category's groups as well.

@paulrougieux
Copy link

@TomAugspurger I misunderstood your previous comment about the "fixed set of categories that should persist across operations ". Thanks for clarifying.

@arw2019
Copy link
Member

arw2019 commented May 14, 2020

I'm interested in working on this, and the related issue (#34162)!

I'll dig into it and post my progress on the thread.

@jangorecki
Copy link

@arw2019 recently discussed #32918 looks to be related as well

@jseabold
Copy link
Contributor

jseabold commented Sep 1, 2020

Trying to think through the examples here and whether my current "usual" columns are more or less prevalent to not want sql group by semantics by default. E.g., for the case of grouping by cities and states, they're both factors and they're not independent. I guess you could say that they should be single city-state factors?

When you have a fixed set of categories that should persist across operations (e.g. survey results).

FWIW, I think this is what I use unstack for but am not sure. Could you post an example workflow to compare?

Edit: Well, I just ran into a case where the factors are independent, and I was relying on the current default behavior. You want the Cartesian product in (almost?) all of these cases. E.g., something like counts of species at field sites. Just because you don't observe a deer doesn't mean that you couldn't have observed a deer. But, yeah, I think typically I use an unstack and/or an outer join to solve this. I'll have to pay attention to this keyword in almost every case now that I know about it.

@jreback jreback added this to the 1.2 milestone Dec 3, 2020
@jorisvandenbossche jorisvandenbossche modified the milestones: 1.2, 1.3 Dec 8, 2020
@cottrell
Copy link
Contributor

Did we ever discussion changed the default observed keyword? @TomAugspurger

Sent from my iPhone
On Jan 2, 2020, at 4:21 PM, jbrockmendel @.***> wrote:  passing observed=True solves the problem. id like to add a warning or something for users who find themselves about to hit a MemoryError — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

I believe I warned about this last year here: #17605 (comment)

@simonjayhawkins
Copy link
Member

removing milestone

@simonjayhawkins simonjayhawkins removed this from the 1.3 milestone Jun 11, 2021
@psarka
Copy link

psarka commented Oct 14, 2021

Another datapoint on how one can get stumped by this issue:

I never use categoricals, and never expected to, but it so happens that they are used automatically when writing to parquet with partitions:

import pandas as pd

df = pd.DataFrame({'a': [1, 0, 1, 0], 'b': [1, 2, 3, 4]})
df.to_parquet('test.parquet', partition_cols='a')

# pd.read_parquet('test.parquet') will return a DataFrame with a categorical column a.

The rest of the story is the same, do a group by, everything explodes.

@paulrougieux
Copy link

paulrougieux commented Oct 14, 2021

Could anyone provide a use case for a observed=False?

When you have a fixed set of categories that should persist across operations (e.g. survey results).

@TomAugspurger Can you please clarify your answer by distinguishing the case where there is only one grouping variable from the case where there are many grouping variables? I understand the need to persist unobserved categories (similar to factor variables in R) of a single vector across operations, but in the case of many grouping variables, why would you want to keep all possible combination set of categories across observations, if the particular combinations are unobserved in the data?

@joeyearsley
Copy link

Ran into this issue recently as we switched from using csv's and strings, to categoricals and feather files.

Took myself quite a while to stumble onto it being the datatype issue in the groupby, ➕ on changing the default to observed=True

@corriebar
Copy link
Contributor

In the last months, we had multiple issues due to this. We changed one column to categorical somewhere in our pipeline and got non-obvious errors in very different parts of our pipeline. From my side, also strong ➕ to change the defaults toobserved=True.

As mentioned before and elsewhere in related issues, the behaviour imho is not very intuitive: SQL does not behave like this nor does the R tidyverse with factors.
I was especially confused that having a single categorical grouper in a list of multiple groupers turns all groupers (behaviourwise) categorical, i.e. it outputs the Cartesian product.
I think the only example I've seen mentioned so far as where this might be useful would be survey data with e.g. likert scales. I work with survey data but we have different answer options for different questions, some are likert scale, some are yes/no etc. When using groupby to get counts per question per answer option, the Cartesian product is not really what you want.
It is certainly not clear how to reasonably show empty groups when using multiple groupers and only one is categorical but to me using observed=True as the default seems like the better option than making non-intuitive assumptions.

I also noticed that when grouping by a categorical index column, observed=True does not drop empty groups.

@rhshadrach
Copy link
Member

rhshadrach commented Oct 4, 2022

I think it’s a good default for the original design semantics of Categorical. It’s a bad default for the memory-saving aspect of categorical.

I definitely think this has some truth to it, but only partially. One reason to use categorical within groupby is to take advantage of observed=False. However another reason to use categorical generally is to save memory with e.g. string columns. I would hazard a guess that the latter is much more prevalent than the former, though I have nothing to back this up with.

I do find it surprising that changing from e.g. int to categorical alters the default groupby result, and since observed=False can result in memory issues, I think there is a good case to make observed=True the default. I'm +1 here.

@cottrell
Copy link
Contributor

cottrell commented Oct 5, 2022

I think it’s a good default for the original design semantics of Categorical. It’s a bad default for the memory-saving aspect of categorical.

I definitely think this has some truth to it, but only partially. One reason to use categorical within groupby is to take advantage of observed=False. However another reason to use categorical generally is to save memory with e.g. string columns. I would hazard a guess that the latter is much more prevalent than the former, though I have nothing to back this up with.

I do find it surprising that changing from e.g. int to categorical alters the default groupby result, and since observed=False can result in memory issues, I think there is a good case to make observed=True the default. I'm +1 here.

Replace "save memory" with "bound memory by data size and instead of merely by product of arities of dimensions of categoricals". I don't think people are understanding the cursive of dimensionality issue here. This is like taking a sparse tensor of N dimensions and going .todense() by default and calling the previous step merely "memory saving".

@paulrougieux
Copy link

paulrougieux commented Oct 5, 2022

  • Many users in this thread are strong proponent of having observed=True as the default. Alternatively, they could also not use categorical variables at all when groupby() operations are needed.
  • Sample code and data from actual users of observed=False could help understand how their work would be damaged by switching to a default of observed=True.
    • Maybe the authors of Quantipy or PandaSurvey can provide sample code and data? (these are returned for a google search of "pandas survey data package", but have not been in active development since 4 and 8 years respectively according to the latest commits)

Other points to clarify different use cases of categorical variables:

  1. A difference has to be made between
    • the case where there are only one or two grouping variable and where there is no risk of combinatorial explosion
    • and the case with many grouping variable and there is a risk of combinatorial explosion.
  2. As already mentioned by @corriebar categorical variables do not behave like this in R.
  3. The different uses for categorical variables should also be clarified: to keep survey options even if there was no reply, to keep and reorder named variables in a graph (such as ordering country names by population size instead of alphabetical order for example), as the result of using a partition column in a parquet file df.to_parquet(partition_cols="a"), many other uses come here ....
  4. The observed argument was switched from True to False in pandas version 0.25. Maybe this was a buggy behaviour as @https://github.com/TomAugspurger wrote at the beginning of this issue. Instead of specifying it at each function call, switching observed to True or False could be dealt with globally. I don't know if it makes sense, using for example an environment variable? [UPDATE] Such a switch would probably be a bad idea.
  5. There has been recent activity related to the observed=False argument in the changelog of version 1.5.1 in interaction with a bug when using dropna=False.

[UPDATE]
@jankatins wrote in pull request 35967 that this reminds him of the stringsAsFactors argument in R. A detailed story stringsAsFactors an unauthorized biography:

"The argument ‘stringsAsFactors’ is an argument to the ‘data.frame()’ function in R. It is a logical that indicates whether strings in a data frame should be treated as factor variables or as just plain strings. [...] By default, ‘stringsAsFactors’ is set to TRUE."
"Most people I talk to today who use R are completely befuddled by the fact that ‘stringsAsFactors’ is set to TRUE by default. [...]"
"In the old days, when R was primarily being used by statisticians and statistical types, this setting strings to be factors made total sense. In most tabular data, if there were a column of the table that was non-numeric, it almost certainly encoded a categorical variable. Think sex (male/female), country (U.S./other), region (east/west), etc. In R, categorical variables are represented by ‘factor’ vectors and so character columns got converted factor. [...]
Why do we need factor variables to begin with? Because of modeling functions like ‘lm()’ and ‘glm()’. Modeling functions need to treat expand categorical variables into individual dummy variables, so that a categorical variable with 5 levels will be expanded into 4 different columns in your modeling matrix." [...]
"Around June of 2007, R introduced hashing of CHARSXP elements in the underlying C code thanks to Seth Falcon. What this meant was that effectively, character strings were hashed to an integer representation and stored in a global table in R. Anytime a given string was needed in R, it could be referenced by its underlying integer. This effectively put in place, globally, the factor encoding behavior of strings from before. Once this was implemented, there was little to be gained from an efficiency standpoint by encoding character variables as factor. Of course, you still needed to use ‘factors’ for the modeling functions."

@rhshadrach
Copy link
Member

@jbrockmendel

passing observed=True solves the problem. id like to add a warning or something for users who find themselves about to hit a MemoryError

If observed=True was the default, would you still find it beneficial to have a warning here?

@jbrockmendel
Copy link
Member Author

i think changing the default would close this

@jseabold
Copy link
Contributor

jseabold commented Mar 7, 2023

i think changing the default would close this

Indeed. I changed and deprecated the default in #35967 but you also have to merge the changes :)

@Alexia-I
Copy link

Seems that this performance improvement could be backported to previous versions due to its severity? Also, the code change seems not too complex to backport.

@rhshadrach
Copy link
Member

The improvement here was changing the default value to observed from False to True. We should not be backporting API changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Groupby Performance Memory or execution speed performance
Projects
None yet