-
Notifications
You must be signed in to change notification settings - Fork 670
FIX-#3252: Fix GroupBy.__getitem__ #3298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3298 +/- ##
===========================================
- Coverage 85.32% 61.50% -23.83%
===========================================
Files 150 145 -5
Lines 15826 15681 -145
===========================================
- Hits 13504 9645 -3859
- Misses 2322 6036 +3714
Continue to review full report at Codecov.
|
…upBy API Co-authored-by: Yaroslav Igoshev <Poolliver868@mail.ru> Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
4d0c6f7 to
a30a933
Compare
|
@dchigarev , fix pydocstyle job, please. |
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
| """ | ||
| index = [] if index is None else index | ||
| columns = [] if columns is None else columns | ||
| index = slice(None) if index is None else index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bug in a BaseQueryCompiler.view regarding to the incorrect default values (implicitly causes some problems in a Groupby.getitem for a base backend). Ideally, this should be a separate PR fixing this, but maybe we can keep it here since the fix is just about fixing the typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to move this to a new PR to track the change at a finer granularity.
modin/pandas/groupby.py
Outdated
| f, dict | ||
| ), "'{0}' object is not callable and not a dict".format(type(f)) | ||
|
|
||
| internal_by = self._get_internal_by() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a full code duplication from _wrap_aggregation because these functions are actually equal. In #3197 _apply_agg_function will be removed as well as this duplication
|
|
||
| @_inherit_docstrings(pandas.core.groupby.SeriesGroupBy) | ||
| class SeriesGroupBy(DataFrameGroupBy): | ||
| dtype = DataFrameGroupBy.dtypes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this attribute was missing
| if isinstance(agg_func, dict) | ||
| else agg_func(grp, **agg_args) | ||
| ) | ||
| if ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this condition has the same meaning as here, the difference is that the base backend already handles all of the cases naturally except this one with the selection.
Ideally, we want to move all of the exceptions checks to the front-end, this would help us to get rid of the code duplication for different backend in terms of raising proper errors and would help to get errors as soon as possible. But this is a task for a different PR
| to_drop = df.columns.intersection(by_part) | ||
| if selection is not None: | ||
| to_drop = to_drop.difference(selection) | ||
| if drop and len(to_drop) > 0: | ||
| df.drop(columns=to_drop, errors="ignore", inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't want to drop conflicting columns if they're presented in the selection
| """ | ||
| index = [] if index is None else index | ||
| columns = [] if columns is None else columns | ||
| index = slice(None) if index is None else index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave this here.
modin/data_management/functions/default_methods/groupby_default.py
Outdated
Show resolved
Hide resolved
…ndex' param Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
…d dtypes Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
|
@dchigarev , what is the issue with partitions communication? |
…re accurately Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
| if isinstance(agg_func, dict): | ||
| apply_indices = tuple(agg_func.keys()) | ||
| if selection is not None and apply_indices is None: | ||
| apply_indices = selection if is_list_like(selection) else (selection,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apply_indices is a parameter of broadcast_apply_full_axis which indicates the indices to apply a function on, so the function could be applied only to the required partitions
modin/data_management/functions/default_methods/groupby_default.py
Outdated
Show resolved
Hide resolved
@gshimansky it's directly caused by recently reported #3376: aggregated columns has an intersection with the "by", because of that conflict, "a" columns is not presented in the result. The reasons for that are described in the actual issue. btw, on the current master the reproducer you provided also does not include "a" to the result |
Ok I see. I'll add a comment with this code into that bug too, to be included in tests. |
Because base query compiler (as any other non-pandas QCs) can't handle I'll create it as soon as we merge this PR Content to copy into a future issue
|
| return GroupByDefault.register(pandas.core.groupby.DataFrameGroupBy.sum)( |
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
|
@YarShev @gshimansky @modin-project/modin-core all comments were addressed, PR is ready for review now |
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
YarShev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchigarev , LGTM, thanks! Left one comment.
|
@modin-project/modin-core , @modin-project/modin-omnisci reminder to review. |
devin-petersohn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really big diff, and I'm working my way through it. I left a couple of questions for my understanding.
| drop : bool, default: False | ||
| If `by` is a QueryCompiler indicates whether or not by-data came | ||
| from the `self`. | ||
| selection : label or list of labels, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is like doing a mask before the groupby, right? Trying to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly, it's more like doing a mask on the data, that's required to be aggregated, so it's like mask of the raw frame's data after grouping:
Code
>>> df
a b c
0 1 5 3
1 1 4 4
2 2 3 5
>>> df[["a", "b"]].groupby("a").sum()
b
a
1 9
2 3
>>> df.groupby("a")[["a", "b"]].sum()
a b
a
1 2 9
2 2 3We can't achieve an effect of aggregating 'by' columns only doing a mask before groupby, unless we'll detach 'by' data from the frame:
Code
>>> df[["a", "b"]].groupby(df["a"]).sum()
a b
a
1 2 9
2 2 3For now, there's no convention on what backend's groupby obtains as 'by', in some cases, it's detached columns (as query compiler of the 'by' columns), in other cases, it's a list of string column names.
It's not obvious which way of passing 'by' was added by mistake and so which one should be avoided, most of our implementation implies that 'by' is a query compiler, supposing that 'by' as query compiler is the true way, however #2744 notation (as I thought this PR should show an idealistic API that we want to follow) it forces us to always use columns names as 'by's.
To be able to work properly with both of 'by' types, we must pass extra information about actual columns that have to be aggregated (selection parameter).
BTW, our current groupby implementation for pandas backend always translates "detached" 'by' into the "attached" ones ([1] [2]) so we will always need to generate some kind of meta-info about actual selection (if we remove selection backend parameter, we still need this info, so it would be computed anyway inside backend's groupby).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this seems like it would be a great developer blog post for how to support this because of how inconsistent and weird it is. @aregm @Garra1980 agree?
It is tempting to add the new API and pass the state through to the backend (and eventually the compute kernels), but I would like to avoid this if possible. We can get the desired behavior by adding a .copy() to the grouping column if the __getitem__ comes after the groupby:
In [1]: import modin.pandas as pd
In [2]: df = pd.DataFrame([[1,5,3], [1,4,4], [2,3,5]], columns=list("abc"))
In [3]: df
Out[3]:
a b c
0 1 5 3
1 1 4 4
2 2 3 5
In [4]: df.groupby("a")[["a", "b"]].sum() # incorrect behavior
Out[4]:
b
a
1 9
2 3
In [5]: df.groupby(df["a"])[["a", "b"]].sum() # also incorrect, semantically equivalent to In[4] in pandas and modin
Out[5]:
b
a
1 9
2 3
In [6]: df.groupby(df["a"].copy())[["a", "b"]].sum() # adding copy at the API layer fixes it
Out[6]:
a b
a
1 2 9
2 2 3
In [7]: df[["a", "b"]].groupby("a").sum() # correct behavior for __getitem__ before groupby
Out[7]:
b
a
1 9
2 3Does this make sense? We can avoid adding additional internal APIs this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of doing a copy, we lose information about where this column came from (whether it originally was from the 'self' frame or it was an external Series). In some cases, GroupBy results depend on the origin of the 'by' data. Anyway, I can try this and see how it goes.
As for now, I decided to detach and move a small piece of this PR that's fixing the actual #3252 issue into a separate PR (#3463), so we can merge the fix until the 0.11 release.
As for this PR (#3298), it's more about supporting groupby.__getitem__ in a proper way, it's much more general than the bug fix, so I convert it to draft for now and will try to get rid of adding 'selection' as a new backend's API.
| """ | ||
| index = [] if index is None else index | ||
| columns = [] if columns is None else columns | ||
| index = slice(None) if index is None else index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to move this to a new PR to track the change at a finer granularity.
| """Compute groupby aggregation for a single partition.""" | ||
| grouped_df = df.groupby(by=by, axis=axis, **groupby_kwargs) | ||
| if partition_selection is not None: | ||
| grouped_df = grouped_df[partition_selection] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why move the mask logic here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not masking, it's actual selection. grouped_df here is a GroupBy object
|
@dchigarev , what is the status here? |
|
@YarShev Regarding this PR, we probably need to close this or convert it to a draft until the new implementation will come along. |
|
@dchigarev , I am okay with the proposal. We should try to put the fix in 0.12 release. With respect to this PR, I suggest closing it and opening a new one. |
|
#3490 is planned for 0.13 |
Ok, don't mind. |
|
@dchigarev please convert this to draft if it's not ready to be reviewed. |
I'm not author of this PR, @YarShev can you do it? |
done |
What do these changes do?
@dchigarev writes:
Note:
GroupBy.__getitem__allows to select specific columns of the source frame to apply aggregation functions on.Usage examples
This PR changes the way of handling columns selection for aggregation in GroupBy. Previously the selection was done by masking the source frame with the
keycolumns:modin/modin/pandas/groupby.py
Lines 341 to 348 in 253ecd9
this approach does not grab 'by' columns to the resulted source frame which leads to
KeyErrors ifself._byis presented by a list of column names, it also does not allow to select 'by' columns to be aggregated, because by default pandas don't aggregate 'by' columns and masking can't make it do so.It was decided to get rid of masking the source frame at
__getitem__and extend GroupBy backend's API by addingselectionparameter that caries information about columns to aggregate.This parameter could help not only for the
__getitem__selection, but to offload backends of mimicking pandas behavior in terms of resolving naming conflicts whenas_index=False. Ideally, the following code should be located at the front-end,selectionparameter allows to move this to the front and pass the proper columns to aggregate as aselectionto the backend:modin/modin/experimental/engines/omnisci_on_ray/frame/data.py
Lines 444 to 456 in 253ecd9
ASV results
Launch command:
Benchmarks were run at this commit which adds groupby benchmarks for the 'selection' parameter
There are some cases for which performance has decreased (ASV marked them by pluses), they're related to the selection for wide frames, reasons for that are described here.
Performance results
TODO list
flake8 modinblack --check modingit commit -sGroupBy.__getitem__does not include 'by' columns into resulted object which leads toKeyError#3252, GroupBy and Agg Breaking with Multiple Parameters #3262docs/developer/architecture.rstis up-to-date