Skip to content

Wrong behaviour of None/np.nan in 'category' type columns #21239

Closed
@topolskib

Description

@topolskib

Code Sample

import pandas as pd

df = pd.DataFrame({'a': ['asd', None, 12, 'asd', 'cde']}, dtype='category')
print(df['a'].apply(lambda x: x=='cde'))
0    False
1     True
2    False
3    False
4     True
Name: a, dtype: object

Problem description

None (or np.nan) is not properly transformed by function used in apply method. From what I understand, this is caused by the fact that apply transforms levels into list of new values (in this case, boolean value indicating if the value is equal 'cde'), and then uses each categorical code as an index to that list. However, in df['a'].cat.codes you can see that None's code is -1, so it returns last element of this new list - which in this case is True.

Expected Output

0    False
1     None
2    False
3    False
4     True
Name: a, dtype: object

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-124-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.23.0
pytest: 3.4.1
pip: 9.0.1
setuptools: 38.5.1
Cython: None
numpy: 1.14.3
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: 5.5.0
sphinx: 1.7.1
patsy: None
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.6
pymysql: None
psycopg2: 2.6 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions