Skip to content

BUG: drop_duplicates() doesn't work for object dtype series containing numpy nans #16632

Open
@ran404

Description

@ran404

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
import pandas.testing as pdt

s1 = pd.Series([np.nan, np.nan, 'text'])
s2 = pd.Series([np.float64(np.nan), np.float64(np.nan),'text'])

# This doesn't blow up, thinks s1 and s2 are the same
pdt.assert_series_equal(s1, s2)

s1_unique = s1.drop_duplicates()
s2_unique = s2.drop_duplicates()

# This blows up
pdt.assert_series_equal(s1_unique, s2_unique)

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-415-6908d74400cd> in <module>()
----> 1 pdt.assert_series_equal(s1_unique, s2_unique)

~/local/ts3/lib/python3.6/site-packages/pandas/util/testing.py in assert_series_equal(left, right, check_dtype, check_index_type, check_series_type, check_less_precise, check_names, check_exact, check_datetimelike_compat, check_categorical, obj)
   1276         raise_assert_detail(obj, 'Series length are different',
   1277                             '{0}, {1}'.format(len(left), left.index),
-> 1278                             '{0}, {1}'.format(len(right), right.index))
   1279 
   1280     # index comparison

~/local/ts3/lib/python3.6/site-packages/pandas/util/testing.py in raise_assert_detail(obj, message, left, right, diff)
   1147         msg = msg + "\n[diff]: {diff}".format(diff=diff)
   1148 
-> 1149     raise AssertionError(msg)
   1150 
   1151 

AssertionError: Series are different

Series length are different
[left]:  2, Int64Index([0, 2], dtype='int64')
[right]: 3, Int64Index([0, 1, 2], dtype='int64')

Problem description

When dealing with mixed dtype Series (sometimes as a result of .T followed by slice operation from dataframes), the drop_duplicates() call is very surprising, as it doesn't work for np.float64(np.nan). I would expect the htable.duplicated_object(values) call to also work with mixed dtypes containing np.float64 nan values.

The drop_duplicates() call does work for python's builtin float.nan, however.

Expected Output

import pandas as pd
import pandas.testing as pdt

s1 = pd.Series([np.nan, np.nan, 'text'])
s2 = pd.Series([np.float64(np.nan), np.float64(np.nan),'text'])

# This doesn't blow up, thinks s1 and s2 are the same
pdt.assert_series_equal(s1, s2)

s1_unique = s1.drop_duplicates()
s2_unique = s2.drop_duplicates()

# The following assertions should not blow up
assert len(s1_unique) == 2
assert len(s2_unique) == 2
pdt.assert_series_equal(s1_unique, s2_unique)

Output of pd.show_versions()

# Paste the output here pd.show_versions() here INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Darwin OS-release: 16.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8

pandas: 0.20.2
pytest: 3.1.1
pip: 9.0.1
setuptools: 36.0.1
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: 0.9.5
IPython: 6.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: 1.5.1
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.8
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.10
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: 0.4.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolateduplicatedduplicated, drop_duplicates

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions