Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupBy with sort=False parameter and categorical by produces incorrect aggregation results #3571

Open
andreas-kopecky opened this issue Oct 19, 2021 · 12 comments · Fixed by #3715
Assignees
Labels
Backport 🔙 Issues that need to be backported to previous release(s) bug 🦗 Something isn't working External Pull requests and issues from people who do not regularly contribute to modin P2 Minor bugs or low-priority feature requests

Comments

@andreas-kopecky
Copy link

andreas-kopecky commented Oct 19, 2021

System information

  • Ubuntu 20.04:
  • Modin version 0.11.1:
  • Python version: 3.7
  • Code we can use to reproduce:

Create some random data with a few entries in it

import pandas as pd
import numpy as np

data = ["a"] * 50000 + ["b"] * 10000 + ["c"] * 1000
np.random.seed(42)
data = np.array(data)
np.random.shuffle(data)
df = pd.DataFrame({"some_column": data})
df.to_csv("some_data.csv")

Read the resulting csv back with modin

from modin import pandas as md
mdf = md.read_csv("some_data.csv", dtype={"some_column": "category"})
mdf["some_column"].value_counts()

This results in an output like:

a    40153
b    18726
c     2121
name: some_column, dtype: int64

but should return

some_column
a              50000
b              10000
c               1000
dtype: int64
@andreas-kopecky
Copy link
Author

I should note: The output is correct, if you don't use dtype: {"some_column": "category"} so i would guess the problem lies with multiple processes attaching new values to categories during read or something like this

@devin-petersohn
Copy link
Collaborator

Hi @andreas-kopecky thanks for the report! I modified your comment to format the code with syntax highlighting, I hope that's okay!

This looks like a bug with value_counts. When I do a groupby.count I get the correct values:

mdf.groupby("some_column").count()

This gives:

             Unnamed: 0
some_column
a                 50000
b                 10000
c                  1000

@dchigarev Will you be able to take a look at the value_counts implementation?

@devin-petersohn
Copy link
Collaborator

Another note: The semantically equivalent (groupby("some_column")["some_column"].count) is also returning the correct result.

mdf.groupby("some_column")["some_column"].count()

Shows:

some_column
a    50000
b    10000
c     1000
Name: some_column, dtype: int64

@dchigarev
Copy link
Collaborator

I'll take a look

@dchigarev dchigarev self-assigned this Oct 19, 2021
@dchigarev
Copy link
Collaborator

dchigarev commented Oct 19, 2021

This looks like a bug with value_counts. When I do a groupby.count I get the correct values:
>>> mdf.groupby("some_column").count()

However, doing the same, but with sort=False parameter breaks everything:

>> mdf.groupby("some_column", sort=False).count()
             Unnamed: 0
some_column
a                 39924
b                 20076
c                  1000
>>> mdf.groupby("some_column", sort=True).count()
             Unnamed: 0
some_column
a                 50000
b                 10000
c                  1000

value_counts actually passes sort=False to not do unnecessary sorting:

counted_values = self.groupby(by=subset, sort=False, dropna=dropna).size()

So the essential problem is hidden in our GroupBy implementation, but suddenly, the real source of the problem is pandas.

Let's emulate Modin's groupby(sort=False).count flow with pandas:

>>> part1 = pandas.DataFrame({
...     "by_col": ["a", "a", "b", "c"],
...     "some_col": [1, 2, 3, 4],
... }).astype({"by_col": "category"})
>>> part2 = pandas.DataFrame({
...     "by_col": ["b", "a", "a", "c"],
...     "some_col": [1, 2, 3, 4],
... }).astype({"by_col": "category"})
# Map phase: applying an aggregation function to every partition
>>> grp_res_part1 = part1.groupby("by_col", sort=False).count()
>>> grp_res_part2 = part2.groupby("by_col", sort=False).count()
>>> grp_res_part1
        some_col
by_col
a              2  # 2 elements in 'a' group in the first partition
b              1
c              1
>>> grp_res_part2
        some_col
by_col
b              1
a              2 # 2 elements in 'a' group in the second partition
c              1
# Concatenating partition's results to proceed to the reduce phase:
>>> pandas.concat([grp_res_part1, grp_res_part2])
        some_col
by_col
a              2 # still 2 elements for 'a' in the first partition
b              1
c              1
a              1 # but only 1 element for 'a'?
b              2 # and 2 for 'b'?...
c              1

So the data is spoiled by pandas.concat when building an axis partition for the reduced phase of groupby. The actual location of this breaking concat is PandasFrameAxisPartition.deploy_axis_func. The data is spoiled because of a bug in pandas.concat that occurs when you're trying to concatenate frames with categorical indices that have the same values but different order. I've filled an issue about this in pandas repo (pandas-dev/pandas#44099).

It seems that there are two possible solutions on Modin side:

  1. Wait until pandas fix the bug
  2. Freeze sort=True parameter at the map phase of groupby flow in order to avoid concatenating partitions with different index orders.

@devin-petersohn what do you think we should pick?

@dchigarev dchigarev changed the title read_csv with ad-hoc typecasting to "category" dtype produces wrong results GroupBy with sort=False parameter and categorical by produces incorrect aggregation results Oct 19, 2021
@dchigarev dchigarev added the bug 🦗 Something isn't working label Oct 19, 2021
@devin-petersohn
Copy link
Collaborator

I do not want to wait on pandas to fix this, how does the sort=False order compare to pandas order? If we sort in the map phase, it's going to end up being out of order compared to pandas, correct?

@dchigarev
Copy link
Collaborator

Right, this will break the pandas' unsorted order.

When sort=False pandas arrange group names in the result in the same order as they appear in the source frame:

>>> pandas.DataFrame({"col": [10, 0, 5, 0, 10, 4, 4]}).groupby("col", sort=False)["col"].count()
col
10    2
0     2
5     1
4     2
Name: col, dtype: int64

I originally thought that the order is random in the case of sort=False, so sorted Modin's output would fit for that contract, but now it's clear that it doesn't.

@dchigarev
Copy link
Collaborator

dchigarev commented Oct 21, 2021

Using value_counts with categorical data seems like a more common use-case rather than the groupby(sort=False). So, in order to fix the common case at first, we can solve the derivative of the source problem by just replacing sort=False with sort=True at the value_counts, this way we will fix the common case sacrificing performance to the correct behavior (groupby(sort=False) would still returning incorrect results though, so I'm not sure whether we want such fix).

@andreas-kopecky
Copy link
Author

@dchigarev thanks a ton for explaining this bug and also provide a workaround for now.

@devin-petersohn
Copy link
Collaborator

@dchigarev Do we want to get the sort=True fix in place for this release?

dchigarev added a commit to dchigarev/modin that referenced this issue Nov 19, 2021
…n groupby

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
@dchigarev
Copy link
Collaborator

Do we want to get the sort=True fix in place for this release?

@devin-petersohn I've created a PR (#3715), let's see how it goes, we may put it into a minor release.

devin-petersohn pushed a commit that referenced this issue Nov 29, 2021
)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
@dchigarev
Copy link
Collaborator

dchigarev commented Nov 30, 2021

Reopening the issue, since #3715 is just hiding the bug instead of fixing it.

The issue is considered to be closed after:

  1. Pandas will have fixed bug on their side: BUG: pd.concat produces frames with inconsistent order when concating the ones with categorical indices pandas-dev/pandas#44099
  2. We will have removed workaround added in FIX-#3571: fallback to 'sort=True' on categorical keys in groupby #3715

@dchigarev dchigarev reopened this Nov 30, 2021
@devin-petersohn devin-petersohn added the Backport 🔙 Issues that need to be backported to previous release(s) label Dec 14, 2021
devin-petersohn pushed a commit that referenced this issue Dec 15, 2021
)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
@RehanSD RehanSD added the P2 Minor bugs or low-priority feature requests label Oct 12, 2022
@anmyachev anmyachev added the External Pull requests and issues from people who do not regularly contribute to modin label Apr 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backport 🔙 Issues that need to be backported to previous release(s) bug 🦗 Something isn't working External Pull requests and issues from people who do not regularly contribute to modin P2 Minor bugs or low-priority feature requests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants