-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Implement mode(dropna=False) #20779
Conversation
""" | ||
Returns the mode(s) of the Categorical. | ||
|
||
Always returns `Categorical` even if only one value. | ||
|
||
Parameters | ||
---------- | ||
dropna : boolean, default True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add versionadded (anytime you add the param)
pandas/tests/frame/test_analytics.py
Outdated
@@ -889,6 +890,40 @@ def test_mode(self): | |||
dtype=df["C"].dtype)}) | |||
tm.assert_frame_equal(df.mode(), exp) | |||
|
|||
def test_mode_dropna(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally you can parameterize on dropna=[True, False]
(to_timedelta(['1 days', 'nan', 'nan']), to_timedelta(['NaT'])), | ||
(to_timedelta(['1 days', 'nan']), to_timedelta(['nan', '1 days'])) | ||
]) | ||
def test_mode_dropna(self, values, expected): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here (you prob can simply compute the expected value based on whether you drop or not in the test)
Codecov Report
@@ Coverage Diff @@
## master #20779 +/- ##
==========================================
+ Coverage 91.84% 91.85% +<.01%
==========================================
Files 153 153
Lines 49540 49546 +6
==========================================
+ Hits 45501 45509 +8
+ Misses 4039 4037 -2
Continue to review full report at Codecov.
|
90bc567
to
b523d23
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some questions on the results where you have a string nan being returned. otherwise looks good.
pandas/core/frame.py
Outdated
@@ -7124,10 +7128,6 @@ def quantile(self, q=0.5, axis=0, numeric_only=True, | |||
a b | |||
0.1 1.3 3.7 | |||
0.5 2.5 55.0 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you remove on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No that's a mistake. Thanks.
|
||
@pytest.mark.parametrize('dropna, expected1, expected2, expected3', [ | ||
(True, ['b'], ['bar'], ['nan']), | ||
(False, ['b'], ['bar', np.nan], ['nan']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the expected string nan here? that's really odd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's because when Series(..., dtype=str)
is specified in the test itself then np.nan
gets converted to 'nan'
and treated like any other string. In this case the dropna
parameter has no effect. The same thing happens with value_counts()
:
In [2]: data = ['foo', 'bar', 'bar', np.nan, np.nan, np.nan]
In [3]: s = pd.Series(data, dtype=str)
In [4]: s.mode(dropna=True)
Out[4]:
0 nan
dtype: object
In [5]: s.mode(dropna=False)
Out[5]:
0 nan
dtype: object
In [6]: s.value_counts(dropna=True)
Out[6]:
nan 3
bar 2
foo 1
dtype: int64
In [7]: s.value_counts(dropna=False)
Out[7]:
nan 3
bar 2
foo 1
dtype: int64
@pytest.mark.parametrize('dropna, expected1, expected2', [ | ||
(True, ['1900-05-03', '2011-01-03', '2013-01-02'], | ||
['2011-01-03', '2013-01-02']), | ||
(False, [np.nan], ['nan', '2011-01-03', '2013-01-02']), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case I think that because we have Series(..., dtype='M8[ns]')
in the test itself the 'nan'
gets converted into a NaT
so it's not actually 'nan'
that is expected but NaT
. I can change the 'nan'
to NaT
in the parametrize part to make this clearer if necessary. There's the same issue in the timedelta test I think.
6b41f60
to
fc33475
Compare
tm.assert_series_equal(Series(c).mode(), exp) | ||
tm.assert_series_equal(s.mode(dropna), expected2) | ||
|
||
@pytest.mark.skipif(not compat.PY3, reason="only PY3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that in Python 3 a warning is raised if the resulting modes cannot be sorted but there is no warning in Python 2:
Python 3
In [2]: s = pd.Series([1, 'foo', 'foo', np.nan, np.nan])
In [3]: s.mode(False).sort_values().reset_index(drop=True)
Out[3]:
/home/paul/pandas-reidy-p/pandas/core/algorithms.py:840: UserWarning: Unable
to sort modes: '<' not supported between instances of 'float' and 'str'
warn("Unable to sort modes: {error}".format(error=e))
0 foo
1 NaN
dtype: object
Python 2
In [2]: s = pd.Series([1, 'foo', 'foo', np.nan, np.nan])
In [3]: s.mode(False).sort_values().reset_index(drop=True)
Out[3]:
0 foo
1 NaN
dtype: object
Relevant source code:
pandas/pandas/core/algorithms.py
Lines 833 to 838 in b3f07b2
f = getattr(htable, "mode_{dtype}".format(dtype=ndtype)) | |
result = f(values) | |
try: | |
result = np.sort(result) | |
except TypeError as e: | |
warn("Unable to sort modes: {error}".format(error=e)) |
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -527,6 +527,7 @@ Other Enhancements | |||
- :func:`to_hdf` and :func:`read_hdf` now accept an ``errors`` keyword argument to control encoding error handling (:issue:`20835`) | |||
- :func:`cut` has gained the ``duplicates='raise'|'drop'`` option to control whether to raise on duplicated edges (:issue:`20947`) | |||
- :func:`date_range`, :func:`timedelta_range`, and :func:`interval_range` now return a linearly spaced index if ``start``, ``stop``, and ``periods`` are specified, but ``freq`` is not. (:issue:`20808`, :issue:`20983`, :issue:`20976`) | |||
- :func:`Series.mode` and :func:`DataFrame.mode` now support the ``dropna`` parameter which can be used to specify whether NaN/NaT values should be considered (:issue:`17534`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can move to 0.24
pandas/core/algorithms.py
Outdated
dropna : boolean, default True | ||
Don't consider counts of NaN/NaT. | ||
|
||
.. versionadded:: 0.23.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change all refs
]) | ||
def test_mode_empty(self, dropna, expected): | ||
s = Series([], dtype=np.float64) | ||
tm.assert_series_equal(s.mode(dropna), expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use result=
data_single = [1] * 5 + [2] * 3 | ||
|
||
exp_multi = [1, 3] | ||
data_multi = [1] * 5 + [2] * 3 + [3] * 5 | ||
|
||
for dt in np.typecodes['AllInteger'] + np.typecodes['Float']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you parameterize this
Hello @reidy-p! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on May 31, 2018 at 10:39 Hours UTC |
41e4227
to
ee21f17
Compare
exp = Series(exp, dtype=dt) | ||
tm.assert_series_equal(s.mode(), exp) | ||
s = Series([1, 'foo', 'foo', np.nan, np.nan, np.nan]) | ||
result = s.mode(dropna).sort_values().reset_index(drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you doing the sort/reset? are we always sorting internally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the sort_values
and reset_index
is unnecessary in this case so I removed it.
It is necessary in the test_mode_sortwarning
test:
def test_mode_sortwarning(self):
# Check for the warning that is raised when the mode
# results cannot be sorted
expected = Series(['foo', np.nan])
s = Series([1, 'foo', 'foo', np.nan, np.nan])
with tm.assert_produces_warning(UserWarning, check_stacklevel=False):
result = s.mode(dropna=False)
result = result.sort_values().reset_index(drop=True)
tm.assert_series_equal(result, expected)
Here the result = s.mode(dropna=False)
is sometimes equal to Series(['foo', np.nan])
and other times it's Series([np.nan, 'foo'])
so I use sort_values
and reset_index
to ensure that result is always Series(['foo', np.nan])
s = Series([1, 'foo', 'foo', np.nan, np.nan, np.nan]) | ||
result = s.mode(dropna).sort_values().reset_index(drop=True) | ||
expected = Series(expected2, dtype=object) | ||
tm.assert_series_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move the mode tests to a separate test class, maybe TestMode (same file, you may not need to inherit TestData I think). just makes this more clear.
@reidy-p looks really good. |
@@ -2068,26 +2072,6 @@ def test_min_max(self): | |||
assert np.isnan(_min) | |||
assert _max == 1 | |||
|
|||
def test_mode(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deleted this test because I think it's all covered in test_mode_category
thanks @reidy-p ! |
git diff upstream/master -u -- "*.py" | flake8 --diff
There are other mode tests in
tests/test_algos.py
but I'm not sure if I should test thedropna
parameter in this file too. The tests seem very similar to those intests/series/test_analytics.py
.