-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GroupBy with sort=False
parameter and categorical by
produces incorrect aggregation results
#3571
Comments
I should note: The output is correct, if you don't use |
Hi @andreas-kopecky thanks for the report! I modified your comment to format the code with syntax highlighting, I hope that's okay! This looks like a bug with mdf.groupby("some_column").count() This gives:
@dchigarev Will you be able to take a look at the |
Another note: The semantically equivalent ( mdf.groupby("some_column")["some_column"].count() Shows:
|
I'll take a look |
However, doing the same, but with >> mdf.groupby("some_column", sort=False).count()
Unnamed: 0
some_column
a 39924
b 20076
c 1000
>>> mdf.groupby("some_column", sort=True).count()
Unnamed: 0
some_column
a 50000
b 10000
c 1000
Line 2889 in 0f944a1
So the essential problem is hidden in our GroupBy implementation, but suddenly, the real source of the problem is pandas. Let's emulate Modin's >>> part1 = pandas.DataFrame({
... "by_col": ["a", "a", "b", "c"],
... "some_col": [1, 2, 3, 4],
... }).astype({"by_col": "category"})
>>> part2 = pandas.DataFrame({
... "by_col": ["b", "a", "a", "c"],
... "some_col": [1, 2, 3, 4],
... }).astype({"by_col": "category"})
# Map phase: applying an aggregation function to every partition
>>> grp_res_part1 = part1.groupby("by_col", sort=False).count()
>>> grp_res_part2 = part2.groupby("by_col", sort=False).count()
>>> grp_res_part1
some_col
by_col
a 2 # 2 elements in 'a' group in the first partition
b 1
c 1
>>> grp_res_part2
some_col
by_col
b 1
a 2 # 2 elements in 'a' group in the second partition
c 1
# Concatenating partition's results to proceed to the reduce phase:
>>> pandas.concat([grp_res_part1, grp_res_part2])
some_col
by_col
a 2 # still 2 elements for 'a' in the first partition
b 1
c 1
a 1 # but only 1 element for 'a'?
b 2 # and 2 for 'b'?...
c 1 So the data is spoiled by It seems that there are two possible solutions on Modin side:
@devin-petersohn what do you think we should pick? |
sort=False
parameter and categorical by
produces incorrect aggregation results
I do not want to wait on pandas to fix this, how does the |
Right, this will break the pandas' unsorted order. When >>> pandas.DataFrame({"col": [10, 0, 5, 0, 10, 4, 4]}).groupby("col", sort=False)["col"].count()
col
10 2
0 2
5 1
4 2
Name: col, dtype: int64 I originally thought that the order is random in the case of |
Using |
@dchigarev thanks a ton for explaining this bug and also provide a workaround for now. |
@dchigarev Do we want to get the |
…n groupby Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
@devin-petersohn I've created a PR (#3715), let's see how it goes, we may put it into a minor release. |
Reopening the issue, since #3715 is just hiding the bug instead of fixing it. The issue is considered to be closed after:
|
System information
Create some random data with a few entries in it
Read the resulting csv back with modin
This results in an output like:
but should return
The text was updated successfully, but these errors were encountered: