-
Notifications
You must be signed in to change notification settings - Fork 670
FIX-#1294: fixed 'value_counts' implementation #2730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2730 +/- ##
==========================================
+ Coverage 85.33% 88.77% +3.44%
==========================================
Files 150 150
Lines 15955 15925 -30
==========================================
+ Hits 13615 14138 +523
+ Misses 2340 1787 -553
Continue to review full report at Codecov.
|
c77b537 to
f45ede8
Compare
devin-petersohn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dchigarev! Let me know if this makes sense
|
@devin-petersohn implementation of Implementation of def value_counts(self, ...):
counted_values = self.groupby( # 1-st backend-API call: MapReduce + Broadcast
by=subset or self.columns, sort=not sort, dropna=dropna
).size()
if sort:
counted_values.sort_values( # 2-nd backend-API call: FullAxisFn
ascending=ascending
)
if normalize:
counted_values = ( # 3-th and 4-th API calls:
counted_values / counted_values.sum() # sum: Map
) # divide: Map + Broadcast
return counted_valuesAs we can see, there are 4 calls to the backend-API now, some of them trigger the creation of FullAxisPartitions, which means concatenate all partitions, and then split it again (expensive operation). In an ideal world, backend-API calls shouldn't be expensive, however all those things with repartitioning and broadcasting make them so. Implementations at the frontend look clean and helps to unload query compiler from unnecessary methods, but it also has the disadvantage of many backend-API calls. What is your opinion about this? |
|
I think it is at most 4, but the common case of the def value_counts(self, ...):
counted_values = self.groupby( # 1-st backend-API call: MapReduce + Broadcast
by=subset or self.columns, sort=sort and not ascending, dropna=dropna
).size()
if ascending:
counted_values = counted_values.sort_values( # 2-nd backend-API call: FullAxisFn
ascending=ascending
)
if normalize:
counted_values = ( # 3-th and 4-th API calls:
counted_values / counted_values.sum() # sum: Map
) # divide: Map + Broadcast
return counted_valuesLong term, this communication overhead will be resolved by the query optimizer. |
It can't, since >>> sr = pandas.Series([1, 2, 2, 3])
>>> sr.value_counts(sort=True)
2 2
1 1
3 1
dtype: int64
>>> sr.groupby(sr, sort=True).size()
1 1
2 2
3 1In a long term perspective, it's good to have this implementation at the front, but just mentioning that in comparison with 0.8.3 we will have a noticeable performance regression in |
|
@dchigarev I see, yes. So Do you have performance numbers if |
|
@devin-petersohn here is the performance with implementation at the API level data |
f2d4462 to
3516431
Compare
devin-petersohn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchigarev Do we want this on 0.11? I think it would be good to add it there.
modin/pandas/test/utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this hide bugs related to index name?
Yes, I'd also like to have this in 0.11, I'll add this issue to the release tracker when it'll be created |
36189a6 to
143e763
Compare
|
@gshimansky @modin-project/modin-core all the comments have been addressed, PR is ready for review |
YarShev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| # TODO: uncomment the following lines when #3331 issue will be closed | ||
| # @prepend_to_notes( | ||
| # """ | ||
| # In comparison with pandas, Modin's ``value_counts`` returns Series with ``MultiIndex`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am starting to think that we need to have some additional "compatibility mode" option in Modin that would try to reproduce the closest possible to Pandas behavior even at an expense of performance. For example #3298 also introduces a known difference, but in that case it is very subtle. In this case an object of different class (Index instead of MultiIndex) may really break some user's application when they replace import pandas with import modin.pandas.
What do you think if adding some MODIN_MAXIMUM_PANDAS_COMPAT option to enable workarounds for such differences?
Possibly this shouldn't be discussed in this PRs comments and brought somewhere outside.
|
@gshimansky , could you file an issue to discuss your suggestion there? |
vnlitvinov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, but I have a few questions.
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
1dfc27b
vnlitvinov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…o issue_1294 Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
|
@devin-petersohn could you please have a look? |
devin-petersohn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor style thing and a question. Looks great!
| if sort: | ||
| counted_values.sort_values(ascending=ascending, inplace=True) | ||
| if normalize: | ||
| counted_values = counted_values / counted_values.sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slight style preference here:
| counted_values = counted_values / counted_values.sum() | |
| return counted_values / counted_values.sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't agree with this because there potentially could be some extra logic of postprocessing the final result
modin/pandas/test/utils.py
Outdated
| # why this clumsy if-statement is used. | ||
| res.index = res.index.droplevel(0) | ||
| res.name = series.name | ||
| # GroupBy discards original index names, so restoring them here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we do this in the test? Can we do this in the code to make sure this will always pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed code regarding to the Series name restoring and suddenly all tests have passed.
As for the index names, we can't do this names restoring in the GroupBy code because its natural behaviour is to set index names with the 'by' labels. Since the logic in this function is just about preprocessing frame before its comparison we want to index names be untouched, that's why we need this names restoring here.
I've clarified this at the in-code comment
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
devin-petersohn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchigarev Looks good. Can you clear the conflicts?
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
|
@devin-petersohn conflicts were solved and CI is green |
devin-petersohn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dchigarev, LGTM
|
Needs review from @modin-project/modin-omnisci |
Signed-off-by: Dmitry Chigarev dmitry.chigarev@intel.com
What do these changes do?
This PR moves
value_countsimplementation to the API level, technicallyvalue_countsisgroupby().size(), so this is the way it was implemented at the front-end.Side-effects:
In addition to
groupbyitself, there are some more steps to cover thevalue_countsparameters, all of these steps can't be done via groupby, so they're applied to the counted values separately from groupby. These separate applying generates up to 3 extra backend-API calls, since there is no query optimizer (for pandas backend) that would pack multiple calls into a single one we're getting an extra overhead of potential resplitting the frame several times during a singlevalue_countscall (more info at this comment).This new behavior leads to a performance regression in certain cases (for pandas backend only), but as it was agreed, we're okay with this for now (we believe in a bright future with a decent query optimizer).
Performance changes:
Here are the results of
ASVbenchmarks, they were run on 24 cores with a "Big" dataset, #2730 branch against the current master.Run command:
I manually added
sort=Falsecase, becausesort_values()is currently defaulting to pandas and so we don't want to compare defaulting-to-pandas vs non-defaulting-to-pandas.ASV results
Expanded timings and comparison with pandas are posted here
Formal todo list:
flake8 modinblack --check modingit commit -s