Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Implement mode(dropna=False) #20779

Merged
merged 8 commits into from
May 31, 2018
Merged

Conversation

reidy-p
Copy link
Contributor

@reidy-p reidy-p commented Apr 21, 2018

There are other mode tests in tests/test_algos.py but I'm not sure if I should test the dropna parameter in this file too. The tests seem very similar to those in tests/series/test_analytics.py.

"""
Returns the mode(s) of the Categorical.

Always returns `Categorical` even if only one value.

Parameters
----------
dropna : boolean, default True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add versionadded (anytime you add the param)

@@ -889,6 +890,40 @@ def test_mode(self):
dtype=df["C"].dtype)})
tm.assert_frame_equal(df.mode(), exp)

def test_mode_dropna(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally you can parameterize on dropna=[True, False]

(to_timedelta(['1 days', 'nan', 'nan']), to_timedelta(['NaT'])),
(to_timedelta(['1 days', 'nan']), to_timedelta(['nan', '1 days']))
])
def test_mode_dropna(self, values, expected):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here (you prob can simply compute the expected value based on whether you drop or not in the test)

@codecov
Copy link

codecov bot commented Apr 21, 2018

Codecov Report

Merging #20779 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20779      +/-   ##
==========================================
+ Coverage   91.84%   91.85%   +<.01%     
==========================================
  Files         153      153              
  Lines       49540    49546       +6     
==========================================
+ Hits        45501    45509       +8     
+ Misses       4039     4037       -2
Flag Coverage Δ
#multiple 90.25% <100%> (ø) ⬆️
#single 41.87% <22.22%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/series.py 94.12% <100%> (ø) ⬆️
pandas/core/frame.py 97.22% <100%> (ø) ⬆️
pandas/core/algorithms.py 94.81% <100%> (+0.31%) ⬆️
pandas/core/arrays/categorical.py 95.69% <100%> (+0.01%) ⬆️
pandas/core/sparse/array.py 91.38% <0%> (+0.06%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7647969...6f14357. Read the comment docs.

@jreback jreback added Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Apr 21, 2018
@reidy-p reidy-p force-pushed the mode_dropna branch 7 times, most recently from 90bc567 to b523d23 Compare May 2, 2018 19:55
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some questions on the results where you have a string nan being returned. otherwise looks good.

@@ -7124,10 +7128,6 @@ def quantile(self, q=0.5, axis=0, numeric_only=True,
a b
0.1 1.3 3.7
0.5 2.5 55.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you remove on purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that's a mistake. Thanks.


@pytest.mark.parametrize('dropna, expected1, expected2, expected3', [
(True, ['b'], ['bar'], ['nan']),
(False, ['b'], ['bar', np.nan], ['nan'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the expected string nan here? that's really odd

Copy link
Contributor Author

@reidy-p reidy-p May 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's because when Series(..., dtype=str) is specified in the test itself then np.nan gets converted to 'nan' and treated like any other string. In this case the dropna parameter has no effect. The same thing happens with value_counts():

In [2]: data = ['foo', 'bar', 'bar', np.nan, np.nan, np.nan]
In [3]: s = pd.Series(data, dtype=str)

In [4]: s.mode(dropna=True)
Out[4]:
0    nan                         
dtype: object

In [5]: s.mode(dropna=False)
Out[5]:
0    nan                 
dtype: object  

In [6]: s.value_counts(dropna=True)
Out[6]:
nan    3 
bar    2 
foo    1                                        
dtype: int64

In [7]: s.value_counts(dropna=False)
Out[7]:
nan    3 
bar    2 
foo    1                                        
dtype: int64

@pytest.mark.parametrize('dropna, expected1, expected2', [
(True, ['1900-05-03', '2011-01-03', '2013-01-02'],
['2011-01-03', '2013-01-02']),
(False, [np.nan], ['nan', '2011-01-03', '2013-01-02']),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I think that because we have Series(..., dtype='M8[ns]') in the test itself the 'nan' gets converted into a NaT so it's not actually 'nan' that is expected but NaT. I can change the 'nan' to NaT in the parametrize part to make this clearer if necessary. There's the same issue in the timedelta test I think.

@reidy-p reidy-p force-pushed the mode_dropna branch 2 times, most recently from 6b41f60 to fc33475 Compare May 8, 2018 18:28
tm.assert_series_equal(Series(c).mode(), exp)
tm.assert_series_equal(s.mode(dropna), expected2)

@pytest.mark.skipif(not compat.PY3, reason="only PY3")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that in Python 3 a warning is raised if the resulting modes cannot be sorted but there is no warning in Python 2:

Python 3

In [2]: s = pd.Series([1, 'foo', 'foo', np.nan, np.nan])                       
In [3]: s.mode(False).sort_values().reset_index(drop=True)
Out[3]:
/home/paul/pandas-reidy-p/pandas/core/algorithms.py:840: UserWarning: Unable                                    
 to sort modes: '<' not supported between instances of 'float' and 'str'  
  warn("Unable to sort modes: {error}".format(error=e))                                                             
0    foo
1    NaN
dtype: object  

Python 2

In [2]: s = pd.Series([1, 'foo', 'foo', np.nan, np.nan])                       
In [3]: s.mode(False).sort_values().reset_index(drop=True)
Out[3]:
0    foo
1    NaN
dtype: object  

Relevant source code:

f = getattr(htable, "mode_{dtype}".format(dtype=ndtype))
result = f(values)
try:
result = np.sort(result)
except TypeError as e:
warn("Unable to sort modes: {error}".format(error=e))

@@ -527,6 +527,7 @@ Other Enhancements
- :func:`to_hdf` and :func:`read_hdf` now accept an ``errors`` keyword argument to control encoding error handling (:issue:`20835`)
- :func:`cut` has gained the ``duplicates='raise'|'drop'`` option to control whether to raise on duplicated edges (:issue:`20947`)
- :func:`date_range`, :func:`timedelta_range`, and :func:`interval_range` now return a linearly spaced index if ``start``, ``stop``, and ``periods`` are specified, but ``freq`` is not. (:issue:`20808`, :issue:`20983`, :issue:`20976`)
- :func:`Series.mode` and :func:`DataFrame.mode` now support the ``dropna`` parameter which can be used to specify whether NaN/NaT values should be considered (:issue:`17534`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can move to 0.24

dropna : boolean, default True
Don't consider counts of NaN/NaT.

.. versionadded:: 0.23.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change all refs

])
def test_mode_empty(self, dropna, expected):
s = Series([], dtype=np.float64)
tm.assert_series_equal(s.mode(dropna), expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use result=

data_single = [1] * 5 + [2] * 3

exp_multi = [1, 3]
data_multi = [1] * 5 + [2] * 3 + [3] * 5

for dt in np.typecodes['AllInteger'] + np.typecodes['Float']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize this

@pep8speaks
Copy link

pep8speaks commented May 17, 2018

Hello @reidy-p! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on May 31, 2018 at 10:39 Hours UTC

@reidy-p reidy-p force-pushed the mode_dropna branch 3 times, most recently from 41e4227 to ee21f17 Compare May 17, 2018 21:45
exp = Series(exp, dtype=dt)
tm.assert_series_equal(s.mode(), exp)
s = Series([1, 'foo', 'foo', np.nan, np.nan, np.nan])
result = s.mode(dropna).sort_values().reset_index(drop=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you doing the sort/reset? are we always sorting internally?

Copy link
Contributor Author

@reidy-p reidy-p May 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the sort_values and reset_index is unnecessary in this case so I removed it.

It is necessary in the test_mode_sortwarning test:

def test_mode_sortwarning(self):
# Check for the warning that is raised when the mode
# results cannot be sorted

expected = Series(['foo', np.nan])
s = Series([1, 'foo', 'foo', np.nan, np.nan])

with tm.assert_produces_warning(UserWarning, check_stacklevel=False):
    result = s.mode(dropna=False)
    result = result.sort_values().reset_index(drop=True)

tm.assert_series_equal(result, expected)

Here the result = s.mode(dropna=False) is sometimes equal to Series(['foo', np.nan]) and other times it's Series([np.nan, 'foo']) so I use sort_values and reset_index to ensure that result is always Series(['foo', np.nan])

s = Series([1, 'foo', 'foo', np.nan, np.nan, np.nan])
result = s.mode(dropna).sort_values().reset_index(drop=True)
expected = Series(expected2, dtype=object)
tm.assert_series_equal(result, expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move the mode tests to a separate test class, maybe TestMode (same file, you may not need to inherit TestData I think). just makes this more clear.

@jreback jreback added this to the 0.24.0 milestone May 30, 2018
@jreback
Copy link
Contributor

jreback commented May 30, 2018

@reidy-p looks really good.

@@ -2068,26 +2072,6 @@ def test_min_max(self):
assert np.isnan(_min)
assert _max == 1

def test_mode(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted this test because I think it's all covered in test_mode_category

@jreback jreback merged commit f1631be into pandas-dev:master May 31, 2018
@jreback
Copy link
Contributor

jreback commented May 31, 2018

thanks @reidy-p !

@reidy-p reidy-p deleted the mode_dropna branch June 2, 2018 12:07
david-liu-brattle-1 pushed a commit to david-liu-brattle-1/pandas that referenced this pull request Jun 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: DataFrame.mode(dropna=False)
3 participants