Skip to content

DataFrame.nunique is incorrect for DataFrame with no columns #21959

Closed
@streamnsight

Description

@streamnsight

(edit by @TomAugspurger)

Current output:

In [33]: pd.DataFrame(index=[0, 1]).nunique()
Out[33]:
Empty DataFrame
Columns: []
Index: [0, 1]

Expected Output is an empty series:

Out[34]: Series([], dtype: float64)

Not sure what the expected dtype of that Series should be... probably object.

original post below:


Code Sample, a copy-pastable example if possible

With Pandas 0.20.3

# create a DataFrame with 3 rows
df = pd.DataFrame({'a': ['A','B','C']})

# lookup unique values for each column, excluding 'a'
unique = df.loc[:, (df.columns != 'a')].nunique()
# this results in an empty Series, the index is also empty
unique.index.tolist()
>>> []
# and
unique[unique == 1].index.tolist()
>>> []

With pandas 0.23.3

# create a DataFrame with 3 rows
df = pd.DataFrame({'a': ['A','B','C']})

# lookup unique values for each column, excluding 'a'
unique = df.loc[:, (df.columns != 'a')].nunique()
# this results in an empty Series, but the index is not empty
unique.index.tolist()
>>> [1,2,3]
also:
unique[unique == 1].index.tolist()
>>> [1,2,3]

Note:

# if we have don't have an empty df, the behavior of nunique() seems fine:
df = pd.DataFrame({'a': ['A','B','C'], 'b': [1,1,1]})
unique = df.loc[:, (df.columns != 'a')].nunique()

unique[unique == 1]
>>> b    1
>>> dtype: int64
# and
unique[unique == 1].index.tolist()
>>> ['b']

Problem description

The change of behavior is a bit disturbing, and seems like it is a bug:
nunique() ends up creating a Series, and it should be a Series of the df columns, but that doesn't seem to be the case here, instead it is picking up the index of the df.

This is likely related to:

#21932
#21255

I am posting this because in my use case I use the list to drop the columns, but i end up with column names that do not exist in the df

INSTALLED VERSIONS ------------------ commit: None python: 2.7.10.final.0 python-bits: 64 OS: Darwin OS-release: 17.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.23.3 pytest: None pip: 10.0.1 setuptools: 39.2.0 Cython: None numpy: 1.13.3 scipy: 0.19.1 pyarrow: None xarray: None IPython: None sphinx: 1.5.5 patsy: 0.5.0 dateutil: 2.6.1 pytz: 2015.7 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: 0.9.4 xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: 2.7.3.2 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    IndexingRelated to indexing on series/frames, not to indexes themselvesMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolateRegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions