-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: groupby with many empty groups memory blowup #30552
Comments
this is the point of the observed keyword |
So yea to second Jeff’s comment above is this a problem with observed=True? |
passing observed=True solves the problem. id like to add a warning or something for users who find themselves about to hit a MemoryError |
Did we ever discussion changed the default observed keyword? @TomAugspurger
…Sent from my iPhone
On Jan 2, 2020, at 4:21 PM, jbrockmendel ***@***.***> wrote:
passing observed=True solves the problem. id like to add a warning or something for users who find themselves about to hit a MemoryError
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I don't recall a discussion about that. I vaguely recall that this will be somewhat solved by having a DictEncodedArray that has a similar data model to Categorical, without the unobserved / fixed categories semantics. |
Hmm I think that is orthogonal. IIUC the memory blowup is because by default we are generating Cartesian products. Would rather just deprecate that and switch the default value for observed |
I think it’s a good default for the original design semantics of Categorical. It’s a bad default for the memory-saving aspect of categorical. |
Sounds like we can close this as no-action?
…On Thu, Jan 2, 2020 at 3:40 PM Tom Augspurger ***@***.***> wrote:
I think it’s a good default for the original design semantics of
Categorical. It’s a bad default for the memory-saving aspect of categorical.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#30552?email_source=notifications&email_token=AB5UM6AK4AKANOARO7MCO53Q3Z3NJA5CNFSM4KBD67Q2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH72B5I#issuecomment-570401013>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB5UM6BQJEOLOICRGMW4KZTQ3Z3NJANCNFSM4KBD67QQ>
.
|
i suppose we would show a PerformanceWarning if we detect this would happen which is pretty cheap to do might be worthwhile |
I cannot reproduce the issue above with When upgrading from pandas 0.24.2 to 0.25.3, I had a memory issue with a groupby().agg() on a data frame with 100 000 rows and 11 columns. I used 8 grouping variables with a mix of categorical and character variables and the grouping operation was using over 8Gb of memory. Setting the argument
fixed the memory issue. Related Stack Overflow question: Pandas v 0.25 groupby with many columns gives memory error. Maybe |
What are the actions to be taken in this issue? Changing the default, depreciating argument, documenting, or something else? It is not clear from current discussion. |
We're not changing the behavior. The only proposal that's on the table is detecting and warning when we have / are about to allocate too much memory.
When you have a fixed set of categories that should persist across operations (e.g. survey results). |
Thanks. It is pretty significant regression, not very new one already, but still. And how is it useful for grouping? Empty groups are being returned? |
What's the regression? All categories are present in the output of a groupby by design, including unobserved categories. |
Regression is that code that worked fine on 0.24.2 is now hitting MemoryError |
Did it have the correct output in 0.24.2? In pandas 0.24.2 I have In [8]: cat = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
In [9]: df = pd.DataFrame({"A": cat, "B": [1, 2], "C": [3, 4]})
In [10]: df.groupby(["A", "B"]).size()
Out[10]:
A B
a 1 1
b 2 1
dtype: int64 which is incorrect. So it sounds like you were relying on buggy behavior. |
In the context of multiple categorical variables used as the groupby index. Persisting a fixed set of categories (inside each categorical variable) is a different issue i.e. maybe the |
Sorry @paulrougieux, I don't follow your proposal. Could you add an example what the two keywords would do?
In [7]: pd.Series([1, 2, 3]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=True).sum().index.dtype
Out[7]: CategoricalDtype(categories=['a', 'b'], ordered=False)
In [8]: pd.Series([1, 2, 3]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=False).sum().index.dtype
Out[8]: CategoricalDtype(categories=['a', 'b'], ordered=False) It only controls whether the aggregation function is applied to the unobserved category's groups as well. |
@TomAugspurger I misunderstood your previous comment about the "fixed set of categories that should persist across operations ". Thanks for clarifying. |
I'm interested in working on this, and the related issue (#34162)! I'll dig into it and post my progress on the thread. |
Trying to think through the examples here and whether my current "usual" columns are more or less prevalent to not want sql group by semantics by default. E.g., for the case of grouping by cities and states, they're both factors and they're not independent. I guess you could say that they should be single city-state factors?
FWIW, I think this is what I use unstack for but am not sure. Could you post an example workflow to compare? Edit: Well, I just ran into a case where the factors are independent, and I was relying on the current default behavior. You want the Cartesian product in (almost?) all of these cases. E.g., something like counts of species at field sites. Just because you don't observe a deer doesn't mean that you couldn't have observed a deer. But, yeah, I think typically I use an unstack and/or an outer join to solve this. I'll have to pay attention to this keyword in almost every case now that I know about it. |
I believe I warned about this last year here: #17605 (comment) |
removing milestone |
Another datapoint on how one can get stumped by this issue: I never use categoricals, and never expected to, but it so happens that they are used automatically when writing to parquet with partitions: import pandas as pd
df = pd.DataFrame({'a': [1, 0, 1, 0], 'b': [1, 2, 3, 4]})
df.to_parquet('test.parquet', partition_cols='a')
# pd.read_parquet('test.parquet') will return a DataFrame with a categorical column a. The rest of the story is the same, do a group by, everything explodes. |
@TomAugspurger Can you please clarify your answer by distinguishing the case where there is only one grouping variable from the case where there are many grouping variables? I understand the need to persist unobserved categories (similar to factor variables in R) of a single vector across operations, but in the case of many grouping variables, why would you want to keep all possible combination set of categories across observations, if the particular combinations are unobserved in the data? |
Ran into this issue recently as we switched from using csv's and strings, to categoricals and feather files. Took myself quite a while to stumble onto it being the datatype issue in the groupby, ➕ on changing the default to |
In the last months, we had multiple issues due to this. We changed one column to categorical somewhere in our pipeline and got non-obvious errors in very different parts of our pipeline. From my side, also strong ➕ to change the defaults to As mentioned before and elsewhere in related issues, the behaviour imho is not very intuitive: SQL does not behave like this nor does the R tidyverse with factors. I also noticed that when grouping by a categorical index column, |
I definitely think this has some truth to it, but only partially. One reason to use categorical within groupby is to take advantage of I do find it surprising that changing from e.g. int to categorical alters the default groupby result, and since |
Replace "save memory" with "bound memory by data size and instead of merely by product of arities of dimensions of categoricals". I don't think people are understanding the cursive of dimensionality issue here. This is like taking a sparse tensor of N dimensions and going .todense() by default and calling the previous step merely "memory saving". |
Other points to clarify different use cases of categorical variables:
[UPDATE]
|
If observed=True was the default, would you still find it beneficial to have a warning here? |
i think changing the default would close this |
Indeed. I changed and deprecated the default in #35967 but you also have to merge the changes :) |
Seems that this performance improvement could be backported to previous versions due to its severity? Also, the code change seems not too complex to backport. |
The improvement here was changing the default value to |
Suppose we have a Categorical with many unused categories:
There are only 24 rows in this DataFrame, so we shouldn't be creating millions of groups.
Without the Categorical, but just a large cross-product that implies many empty groups, this works fine:
The text was updated successfully, but these errors were encountered: