ENH: Implement mode(dropna=False) #20779

reidy-p · 2018-04-21T17:15:28Z

closes Feature request: DataFrame.mode(dropna=False) #17534
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

There are other mode tests in tests/test_algos.py but I'm not sure if I should test the dropna parameter in this file too. The tests seem very similar to those in tests/series/test_analytics.py.

jreback · 2018-04-21T21:53:55Z

pandas/core/arrays/categorical.py

        """
        Returns the mode(s) of the Categorical.

        Always returns `Categorical` even if only one value.

+        Parameters
+        ----------
+        dropna : boolean, default True


add versionadded (anytime you add the param)

jreback · 2018-04-21T21:54:43Z

pandas/tests/frame/test_analytics.py

@@ -889,6 +890,40 @@ def test_mode(self):
                                           dtype=df["C"].dtype)})
        tm.assert_frame_equal(df.mode(), exp)

+    def test_mode_dropna(self):


ideally you can parameterize on dropna=[True, False]

jreback · 2018-04-21T21:54:53Z

pandas/tests/series/test_analytics.py

+        (to_timedelta(['1 days', 'nan', 'nan']), to_timedelta(['NaT'])),
+        (to_timedelta(['1 days', 'nan']), to_timedelta(['nan', '1 days']))
+    ])
+    def test_mode_dropna(self, values, expected):


same here (you prob can simply compute the expected value based on whether you drop or not in the test)

codecov · 2018-04-21T21:55:24Z

Codecov Report

Merging #20779 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #20779      +/-   ##
==========================================
+ Coverage   91.84%   91.85%   +<.01%     
==========================================
  Files         153      153              
  Lines       49540    49546       +6     
==========================================
+ Hits        45501    45509       +8     
+ Misses       4039     4037       -2

Flag	Coverage Δ
#multiple	`90.25% <100%> (ø)`	⬆️
#single	`41.87% <22.22%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/series.py	`94.12% <100%> (ø)`	⬆️
pandas/core/frame.py	`97.22% <100%> (ø)`	⬆️
pandas/core/algorithms.py	`94.81% <100%> (+0.31%)`	⬆️
pandas/core/arrays/categorical.py	`95.69% <100%> (+0.01%)`	⬆️
pandas/core/sparse/array.py	`91.38% <0%> (+0.06%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7647969...6f14357. Read the comment docs.

jreback

some questions on the results where you have a string nan being returned. otherwise looks good.

jreback · 2018-05-04T10:31:37Z

pandas/core/frame.py

@@ -7124,10 +7128,6 @@ def quantile(self, q=0.5, axis=0, numeric_only=True,
               a     b
        0.1  1.3   3.7
        0.5  2.5  55.0
-


did you remove on purpose?

No that's a mistake. Thanks.

jreback · 2018-05-04T10:33:31Z

pandas/tests/series/test_analytics.py

+
+    @pytest.mark.parametrize('dropna, expected1, expected2, expected3', [
+        (True, ['b'], ['bar'], ['nan']),
+        (False, ['b'], ['bar', np.nan], ['nan'])


why is the expected string nan here? that's really odd

I think it's because when Series(..., dtype=str) is specified in the test itself then np.nan gets converted to 'nan' and treated like any other string. In this case the dropna parameter has no effect. The same thing happens with value_counts():

In [2]: data = ['foo', 'bar', 'bar', np.nan, np.nan, np.nan] In [3]: s = pd.Series(data, dtype=str) In [4]: s.mode(dropna=True) Out[4]: 0 nan dtype: object In [5]: s.mode(dropna=False) Out[5]: 0 nan dtype: object In [6]: s.value_counts(dropna=True) Out[6]: nan 3 bar 2 foo 1 dtype: int64 In [7]: s.value_counts(dropna=False) Out[7]: nan 3 bar 2 foo 1 dtype: int64

jreback · 2018-05-04T10:33:43Z

pandas/tests/series/test_analytics.py

+    @pytest.mark.parametrize('dropna, expected1, expected2', [
+        (True, ['1900-05-03', '2011-01-03', '2013-01-02'],
+               ['2011-01-03', '2013-01-02']),
+        (False, [np.nan], ['nan', '2011-01-03', '2013-01-02']),


In this case I think that because we have Series(..., dtype='M8[ns]') in the test itself the 'nan' gets converted into a NaT so it's not actually 'nan' that is expected but NaT. I can change the 'nan' to NaT in the parametrize part to make this clearer if necessary. There's the same issue in the timedelta test I think.

reidy-p · 2018-05-08T21:45:28Z

pandas/tests/series/test_analytics.py

-        tm.assert_series_equal(Series(c).mode(), exp)
+        tm.assert_series_equal(s.mode(dropna), expected2)
+
+    @pytest.mark.skipif(not compat.PY3, reason="only PY3")


It seems that in Python 3 a warning is raised if the resulting modes cannot be sorted but there is no warning in Python 2:

Python 3

In [2]: s = pd.Series([1, 'foo', 'foo', np.nan, np.nan]) In [3]: s.mode(False).sort_values().reset_index(drop=True) Out[3]: /home/paul/pandas-reidy-p/pandas/core/algorithms.py:840: UserWarning: Unable to sort modes: '<' not supported between instances of 'float' and 'str' warn("Unable to sort modes: {error}".format(error=e)) 0 foo 1 NaN dtype: object

Python 2

In [2]: s = pd.Series([1, 'foo', 'foo', np.nan, np.nan]) In [3]: s.mode(False).sort_values().reset_index(drop=True) Out[3]: 0 foo 1 NaN dtype: object

Relevant source code:

pandas/pandas/core/algorithms.py

Lines 833 to 838 in b3f07b2

f = getattr(htable, "mode_{dtype}".format(dtype=ndtype))

result = f(values)

try:

result = np.sort(result)

except TypeError as e:

warn("Unable to sort modes: {error}".format(error=e))

jreback · 2018-05-17T10:17:25Z

doc/source/whatsnew/v0.23.0.txt

@@ -527,6 +527,7 @@ Other Enhancements
 - :func:`to_hdf` and :func:`read_hdf` now accept an ``errors`` keyword argument to control encoding error handling (:issue:`20835`)
 - :func:`cut` has gained the ``duplicates='raise'|'drop'`` option to control whether to raise on duplicated edges (:issue:`20947`)
 - :func:`date_range`, :func:`timedelta_range`, and :func:`interval_range` now return a linearly spaced index if ``start``, ``stop``, and ``periods`` are specified, but ``freq`` is not. (:issue:`20808`, :issue:`20983`, :issue:`20976`)
+- :func:`Series.mode` and :func:`DataFrame.mode` now support the ``dropna`` parameter which can be used to specify whether NaN/NaT values should be considered (:issue:`17534`)


can move to 0.24

jreback · 2018-05-17T10:17:38Z

pandas/core/algorithms.py

+    dropna : boolean, default True
+        Don't consider counts of NaN/NaT.
+
+        .. versionadded:: 0.23.0


change all refs

jreback · 2018-05-17T10:19:02Z

pandas/tests/series/test_analytics.py

+    ])
+    def test_mode_empty(self, dropna, expected):
+        s = Series([], dtype=np.float64)
+        tm.assert_series_equal(s.mode(dropna), expected)


use result=

jreback · 2018-05-17T10:19:15Z

pandas/tests/series/test_analytics.py

        data_single = [1] * 5 + [2] * 3
-
-        exp_multi = [1, 3]
        data_multi = [1] * 5 + [2] * 3 + [3] * 5

        for dt in np.typecodes['AllInteger'] + np.typecodes['Float']:


can you parameterize this

pep8speaks · 2018-05-17T21:29:14Z

Hello @reidy-p! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on May 31, 2018 at 10:39 Hours UTC

jreback · 2018-05-30T10:58:54Z

pandas/tests/series/test_analytics.py

-            exp = Series(exp, dtype=dt)
-            tm.assert_series_equal(s.mode(), exp)
+        s = Series([1, 'foo', 'foo', np.nan, np.nan, np.nan])
+        result = s.mode(dropna).sort_values().reset_index(drop=True)


why are you doing the sort/reset? are we always sorting internally?

Yeah the sort_values and reset_index is unnecessary in this case so I removed it.

It is necessary in the test_mode_sortwarning test:

def test_mode_sortwarning(self): # Check for the warning that is raised when the mode # results cannot be sorted expected = Series(['foo', np.nan]) s = Series([1, 'foo', 'foo', np.nan, np.nan]) with tm.assert_produces_warning(UserWarning, check_stacklevel=False): result = s.mode(dropna=False) result = result.sort_values().reset_index(drop=True) tm.assert_series_equal(result, expected)

Here the result = s.mode(dropna=False) is sometimes equal to Series(['foo', np.nan]) and other times it's Series([np.nan, 'foo']) so I use sort_values and reset_index to ensure that result is always Series(['foo', np.nan])

jreback · 2018-05-30T11:03:18Z

pandas/tests/series/test_analytics.py

+        s = Series([1, 'foo', 'foo', np.nan, np.nan, np.nan])
+        result = s.mode(dropna).sort_values().reset_index(drop=True)
+        expected = Series(expected2, dtype=object)
+        tm.assert_series_equal(result, expected)


can you move the mode tests to a separate test class, maybe TestMode (same file, you may not need to inherit TestData I think). just makes this more clear.

jreback · 2018-05-30T11:06:04Z

@reidy-p looks really good.

…eset_index

reidy-p · 2018-05-30T19:25:57Z

pandas/tests/series/test_analytics.py

@@ -2068,26 +2072,6 @@ def test_min_max(self):
        assert np.isnan(_min)
        assert _max == 1

-    def test_mode(self):


I deleted this test because I think it's all covered in test_mode_category

jreback · 2018-05-31T21:44:34Z

thanks @reidy-p !

reidy-p force-pushed the mode_dropna branch from f2f5208 to 7796dff Compare April 21, 2018 21:55

jreback requested changes Apr 21, 2018

View reviewed changes

jreback added Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Apr 21, 2018

reidy-p force-pushed the mode_dropna branch 7 times, most recently from 90bc567 to b523d23 Compare May 2, 2018 19:55

reidy-p force-pushed the mode_dropna branch from b523d23 to 1081716 Compare May 3, 2018 19:58

jreback requested changes May 4, 2018

View reviewed changes

reidy-p force-pushed the mode_dropna branch 2 times, most recently from 6b41f60 to fc33475 Compare May 8, 2018 18:28

reidy-p commented May 8, 2018

View reviewed changes

reidy-p force-pushed the mode_dropna branch from 6da571d to 1bd0bf2 Compare May 13, 2018 21:04

jreback requested changes May 17, 2018

View reviewed changes

reidy-p force-pushed the mode_dropna branch from 1bd0bf2 to 339052f Compare May 17, 2018 21:29

reidy-p force-pushed the mode_dropna branch 3 times, most recently from 41e4227 to ee21f17 Compare May 17, 2018 21:45

reidy-p added 6 commits May 29, 2018 20:43

ENH: Implement mode(dropna=False)

3b50aec

clean up and parametrize mode tests and add versionadded

c4270a5

Test separately for warning and fix typos

bce740a

Only check for warning with python 3

1fc4893

Test for sort warning in DataFrame and clean up sort test for Series

0399822

change to 0.24 and parametrize series tests

7b59312

reidy-p force-pushed the mode_dropna branch from ee21f17 to 7b59312 Compare May 29, 2018 19:58

jreback requested changes May 30, 2018

View reviewed changes

jreback added this to the 0.24.0 milestone May 30, 2018

Put series mode tests in class and remove uncessary sort_values and r…

7b52c37

…eset_index

reidy-p commented May 30, 2018

View reviewed changes

Merge branch 'master' into PR_TOOL_MERGE_PR_20779

6f14357

jreback approved these changes May 31, 2018

View reviewed changes

jreback merged commit f1631be into pandas-dev:master May 31, 2018

reidy-p deleted the mode_dropna branch June 2, 2018 12:07

david-liu-brattle-1 pushed a commit to david-liu-brattle-1/pandas that referenced this pull request Jun 18, 2018

ENH: Implement mode(dropna=False) (pandas-dev#20779)

27b4476

jorisvandenbossche mentioned this pull request May 30, 2023

PDEP-11: Change default of dropna to False #53094

Closed

	f = getattr(htable, "mode_{dtype}".format(dtype=ndtype))
	result = f(values)
	try:
	result = np.sort(result)
	except TypeError as e:
	warn("Unable to sort modes: {error}".format(error=e))

Uh oh!

ENH: Implement mode(dropna=False) #20779

ENH: Implement mode(dropna=False) #20779

Uh oh!

Conversation

reidy-p commented Apr 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reidy-p May 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pep8speaks commented May 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated on May 31, 2018 at 10:39 Hours UTC

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reidy-p May 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented May 30, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented May 31, 2018

Uh oh!

Uh oh!

reidy-p commented Apr 21, 2018 •

edited

Loading

codecov bot commented Apr 21, 2018 •

edited

Loading

reidy-p May 4, 2018 •

edited

Loading

pep8speaks commented May 17, 2018 •

edited

Loading

reidy-p May 30, 2018 •

edited

Loading