-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#3571: fallback to 'sort=True' on categorical keys in groupby #3715
Conversation
…n groupby Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
@@ -2894,7 +2894,7 @@ def value_counts( | |||
): | |||
if subset is None: | |||
subset = self._query_compiler.columns | |||
counted_values = self.groupby(by=subset, sort=False, dropna=dropna).size() | |||
counted_values = self.groupby(by=subset, dropna=dropna, observed=True).size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort=False -> sort=True
so the users won't see the warning thrown by groupbyobserved=False -> observed=True
because newly added tests forDataFrame.value_counts
showed that it's have to be set (see the example):
Example
>>> df = pandas.DataFrame({"a": [1, 1, 2, 2], "b": [2, 2, 3, 3]}, dtype="category")
>>> df
a b
0 1 2
1 1 2
2 2 3
3 2 3
>>> df.value_counts()
a b
1 2 2
2 3 2
dtype: int64
>>> df.groupby(["a", "b"]).size()
a b
1 2 2
3 0
2 2 0
3 2
dtype: int64
>>> df.groupby(["a", "b"], observed=True).size()
a b
1 2 2
2 3 2
dtype: int64
by = np.array(["a"] * 50000 + ["b"] * 10000 + ["c"] * 1000) | ||
random_state = np.random.RandomState(seed=42) | ||
random_state.shuffle(by) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the bug is partitioning dependent, that's why I'm carrying the same data all around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify - you mean the shuffle is in place so that the bug gets tripped (bc if we didn't shuffle and the values weren't replicated across the partitions, we wouldn't hit the bug?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. For this data shuffled with this specific seed we know for sure that there is a problem caused by the incorrect order of categories in the reduce phase of groupby (see the issue explanation).
This problem might not be visible when the values are in a different order, so I'm using exact data from the #3571 report as a solid reproducer of the issue.
# The bug only occurs in the case of Categorical 'by', so we might want to check whether any of | ||
# the 'by' dtypes is Categorical before going into this branch, however triggering 'dtypes' | ||
# computation if they're not computed may take time, so we don't do it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modin_frame._dtypes
attribute is None
most of the time, its computation requires calling MapReduce and actual data materialization. I don't think we want to trigger this on every groupby(sort=False)
Codecov Report
@@ Coverage Diff @@
## master #3715 +/- ##
==========================================
+ Coverage 85.47% 88.67% +3.19%
==========================================
Files 197 197
Lines 16341 16345 +4
==========================================
+ Hits 13968 14494 +526
+ Misses 2373 1851 -522
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments, besides that looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@dchigarev please fix the merge conflicts.
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
@RehanSD do you have any more comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Looks good to me! |
Signed-off-by: Dmitry Chigarev dmitry.chigarev@intel.com
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
sort=False
parameter and categoricalby
produces incorrect aggregation results #3571docs/developer/architecture.rst
is up-to-dateSince this PR doesn't fix the actual problem but just hiding it, we should keep the #3571 open after the PR is merged.