GroupBy with `sort=False` parameter and categorical `by` produces incorrect aggregation results #3571

andreas-kopecky · 2021-10-19T09:33:40Z

System information

Ubuntu 20.04:
Modin version 0.11.1:
Python version: 3.7
Code we can use to reproduce:

Create some random data with a few entries in it

import pandas as pd
import numpy as np

data = ["a"] * 50000 + ["b"] * 10000 + ["c"] * 1000
np.random.seed(42)
data = np.array(data)
np.random.shuffle(data)
df = pd.DataFrame({"some_column": data})
df.to_csv("some_data.csv")

Read the resulting csv back with modin

from modin import pandas as md
mdf = md.read_csv("some_data.csv", dtype={"some_column": "category"})
mdf["some_column"].value_counts()

This results in an output like:

a    40153
b    18726
c     2121
name: some_column, dtype: int64

but should return

some_column
a              50000
b              10000
c               1000
dtype: int64

The text was updated successfully, but these errors were encountered:

andreas-kopecky · 2021-10-19T09:37:24Z

I should note: The output is correct, if you don't use dtype: {"some_column": "category"} so i would guess the problem lies with multiple processes attaching new values to categories during read or something like this

devin-petersohn · 2021-10-19T13:55:59Z

Hi @andreas-kopecky thanks for the report! I modified your comment to format the code with syntax highlighting, I hope that's okay!

This looks like a bug with value_counts. When I do a groupby.count I get the correct values:

mdf.groupby("some_column").count()

This gives:

             Unnamed: 0
some_column
a                 50000
b                 10000
c                  1000

@dchigarev Will you be able to take a look at the value_counts implementation?

devin-petersohn · 2021-10-19T13:57:48Z

Another note: The semantically equivalent (groupby("some_column")["some_column"].count) is also returning the correct result.

mdf.groupby("some_column")["some_column"].count()

Shows:

some_column
a    50000
b    10000
c     1000
Name: some_column, dtype: int64

dchigarev · 2021-10-19T14:11:09Z

I'll take a look

dchigarev · 2021-10-19T16:52:24Z

This looks like a bug with value_counts. When I do a groupby.count I get the correct values:
>>> mdf.groupby("some_column").count()

However, doing the same, but with sort=False parameter breaks everything:

>> mdf.groupby("some_column", sort=False).count()
             Unnamed: 0
some_column
a                 39924
b                 20076
c                  1000
>>> mdf.groupby("some_column", sort=True).count()
             Unnamed: 0
some_column
a                 50000
b                 10000
c                  1000

value_counts actually passes sort=False to not do unnecessary sorting:

modin/modin/pandas/base.py

Line 2889 in 0f944a1

counted_values = self.groupby(by=subset, sort=False, dropna=dropna).size()

So the essential problem is hidden in our GroupBy implementation, but suddenly, the real source of the problem is pandas.

Let's emulate Modin's groupby(sort=False).count flow with pandas:

>>> part1 = pandas.DataFrame({
...     "by_col": ["a", "a", "b", "c"],
...     "some_col": [1, 2, 3, 4],
... }).astype({"by_col": "category"})
>>> part2 = pandas.DataFrame({
...     "by_col": ["b", "a", "a", "c"],
...     "some_col": [1, 2, 3, 4],
... }).astype({"by_col": "category"})
# Map phase: applying an aggregation function to every partition
>>> grp_res_part1 = part1.groupby("by_col", sort=False).count()
>>> grp_res_part2 = part2.groupby("by_col", sort=False).count()
>>> grp_res_part1
        some_col
by_col
a              2  # 2 elements in 'a' group in the first partition
b              1
c              1
>>> grp_res_part2
        some_col
by_col
b              1
a              2 # 2 elements in 'a' group in the second partition
c              1
# Concatenating partition's results to proceed to the reduce phase:
>>> pandas.concat([grp_res_part1, grp_res_part2])
        some_col
by_col
a              2 # still 2 elements for 'a' in the first partition
b              1
c              1
a              1 # but only 1 element for 'a'?
b              2 # and 2 for 'b'?...
c              1

So the data is spoiled by pandas.concat when building an axis partition for the reduced phase of groupby. The actual location of this breaking concat is PandasFrameAxisPartition.deploy_axis_func. The data is spoiled because of a bug in pandas.concat that occurs when you're trying to concatenate frames with categorical indices that have the same values but different order. I've filled an issue about this in pandas repo (pandas-dev/pandas#44099).

It seems that there are two possible solutions on Modin side:

Wait until pandas fix the bug
Freeze sort=True parameter at the map phase of groupby flow in order to avoid concatenating partitions with different index orders.

@devin-petersohn what do you think we should pick?

devin-petersohn · 2021-10-20T14:48:28Z

I do not want to wait on pandas to fix this, how does the sort=False order compare to pandas order? If we sort in the map phase, it's going to end up being out of order compared to pandas, correct?

dchigarev · 2021-10-21T16:48:00Z

Right, this will break the pandas' unsorted order.

When sort=False pandas arrange group names in the result in the same order as they appear in the source frame:

>>> pandas.DataFrame({"col": [10, 0, 5, 0, 10, 4, 4]}).groupby("col", sort=False)["col"].count()
col
10    2
0     2
5     1
4     2
Name: col, dtype: int64

I originally thought that the order is random in the case of sort=False, so sorted Modin's output would fit for that contract, but now it's clear that it doesn't.

dchigarev · 2021-10-21T16:56:23Z

Using value_counts with categorical data seems like a more common use-case rather than the groupby(sort=False). So, in order to fix the common case at first, we can solve the derivative of the source problem by just replacing sort=False with sort=True at the value_counts, this way we will fix the common case sacrificing performance to the correct behavior (groupby(sort=False) would still returning incorrect results though, so I'm not sure whether we want such fix).

andreas-kopecky · 2021-10-27T07:57:00Z

@dchigarev thanks a ton for explaining this bug and also provide a workaround for now.

devin-petersohn · 2021-11-18T22:12:52Z

@dchigarev Do we want to get the sort=True fix in place for this release?

…n groupby Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev · 2021-11-19T18:41:18Z

Do we want to get the sort=True fix in place for this release?

@devin-petersohn I've created a PR (#3715), let's see how it goes, we may put it into a minor release.

) Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev · 2021-11-30T12:55:27Z

Reopening the issue, since #3715 is just hiding the bug instead of fixing it.

The issue is considered to be closed after:

Pandas will have fixed bug on their side: BUG: pd.concat produces frames with inconsistent order when concating the ones with categorical indices pandas-dev/pandas#44099
We will have removed workaround added in FIX-#3571: fallback to 'sort=True' on categorical keys in groupby #3715

) Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev self-assigned this Oct 19, 2021

dchigarev changed the title ~~read_csv with ad-hoc typecasting to "category" dtype produces wrong results~~ GroupBy with sort=False parameter and categorical by produces incorrect aggregation results Oct 19, 2021

dchigarev added the bug 🦗 Something isn't working label Oct 19, 2021

dchigarev added a commit to dchigarev/modin that referenced this issue Nov 19, 2021

FIX-modin-project#3571: fallback to 'sort=True' on categorical keys i…

8fce53c

…n groupby Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev mentioned this issue Nov 19, 2021

FIX-#3571: fallback to 'sort=True' on categorical keys in groupby #3715

Merged

7 tasks

devin-petersohn closed this as completed in #3715 Nov 29, 2021

devin-petersohn pushed a commit that referenced this issue Nov 29, 2021

FIX-#3571: fallback to 'sort=True' on categorical keys in groupby (#3715

c67a7c5

) Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev reopened this Nov 30, 2021

devin-petersohn added the Backport 🔙 Issues that need to be backported to previous release(s) label Dec 14, 2021

devin-petersohn pushed a commit that referenced this issue Dec 15, 2021

FIX-#3571: fallback to 'sort=True' on categorical keys in groupby (#3715

7c14cac

) Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

anmyachev mentioned this issue Mar 25, 2022

groupby.count() regression #4344

Open

RehanSD added the P2 Minor bugs or low-priority feature requests label Oct 12, 2022

anmyachev added the External Pull requests and issues from people who do not regularly contribute to modin label Apr 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GroupBy with `sort=False` parameter and categorical `by` produces incorrect aggregation results #3571

GroupBy with `sort=False` parameter and categorical `by` produces incorrect aggregation results #3571

andreas-kopecky commented Oct 19, 2021 •

edited by devin-petersohn

Loading

andreas-kopecky commented Oct 19, 2021

devin-petersohn commented Oct 19, 2021

devin-petersohn commented Oct 19, 2021

dchigarev commented Oct 19, 2021

dchigarev commented Oct 19, 2021 •

edited

Loading

devin-petersohn commented Oct 20, 2021

dchigarev commented Oct 21, 2021

dchigarev commented Oct 21, 2021 •

edited

Loading

andreas-kopecky commented Oct 27, 2021

devin-petersohn commented Nov 18, 2021

dchigarev commented Nov 19, 2021

dchigarev commented Nov 30, 2021 •

edited

Loading

GroupBy with sort=False parameter and categorical by produces incorrect aggregation results #3571

GroupBy with sort=False parameter and categorical by produces incorrect aggregation results #3571

Comments

andreas-kopecky commented Oct 19, 2021 • edited by devin-petersohn Loading

System information

andreas-kopecky commented Oct 19, 2021

devin-petersohn commented Oct 19, 2021

devin-petersohn commented Oct 19, 2021

dchigarev commented Oct 19, 2021

dchigarev commented Oct 19, 2021 • edited Loading

devin-petersohn commented Oct 20, 2021

dchigarev commented Oct 21, 2021

dchigarev commented Oct 21, 2021 • edited Loading

andreas-kopecky commented Oct 27, 2021

devin-petersohn commented Nov 18, 2021

dchigarev commented Nov 19, 2021

dchigarev commented Nov 30, 2021 • edited Loading

GroupBy with `sort=False` parameter and categorical `by` produces incorrect aggregation results #3571

GroupBy with `sort=False` parameter and categorical `by` produces incorrect aggregation results #3571

andreas-kopecky commented Oct 19, 2021 •

edited by devin-petersohn

Loading

dchigarev commented Oct 19, 2021 •

edited

Loading

dchigarev commented Oct 21, 2021 •

edited

Loading

dchigarev commented Nov 30, 2021 •

edited

Loading