Skip to content

CategoricalDtype can be lost if addressing several values #349

@hagenw

Description

@hagenw

This is similar to #324, but the underlying problem seems to be a pandas issue.

Let's start with a filewise and a segmented table, both using a 'spk' scheme with the labels 'a' and 'b', and both containing two entries labeled as 'a'.

import audformat


db = audformat.Database('db')

db.schemes['spk'] = audformat.Scheme('str', labels=['a', 'b'])
index = audformat.filewise_index(['f1', 'f2'])
db['files'] = audformat.Table(index)
db['files']['spk'] = audformat.Column(scheme_id='spk')
db['files']['spk'].set(['a', 'a'])

db.schemes['label'] = audformat.Scheme('int')
index = audformat.segmented_index(['f1', 'f1'], [0, 1], [1, 2])
db['segments'] = audformat.Table(index)
db['segments']['spk'] = audformat.Column(scheme_id='spk')
db['segments']['spk'].set(['a', 'a'])

The following behaves as expected:

>>> df = db['files'].get()
>>> df.spk.cat.categories
Index(['a', 'b'], dtype='object')
>>> df.loc['f1', 'spk'] = 'c'
...
TypeError: Cannot setitem on a Categorical with a new category (c), set the categories first
>>> df.iloc[0, 0] = 'c'
...
TypeError: Cannot setitem on a Categorical with a new category (c), set the categories first
>>> df = db['segments'].get()
>>> df.spk.cat.categories
 Index(['a', 'b'], dtype='object')
>>> df.loc[audformat.segmented_index(['f1'], [0], [1]), 'spk'] = 'c'
...
TypeError: Cannot setitem on a Categorical with a new category (c), set the categories first
>>> df.iloc[0, 0] = 'c'
...
TypeError: Cannot setitem on a Categorical with a new category (c), set the categories first

But we can still force to set a forbidden label and remove CategoricalDtype by addressing several values at once:

>>> df = db['files'].get()
>>> df.loc[:, 'spk'] = 'c'
>>> df.spk.cat.categories
...
AttributeError: Can only use .cat accessor with a 'category' dtype
>>> df
     spk
file    
f1     c
f2     c
>>> df = db['segments'].get()
>>> df.loc[:, 'spk'] = 'c'
>>> df.spk.cat.categories
...
AttributeError: Can only use .cat accessor with a 'category' dtype
>>> df
                                     spk
file start           end                
f1   0 days 00:00:00 0 days 00:00:01   c
     0 days 00:00:01 0 days 00:00:02   c

I'm not sure yet if this is considered a feature or a bug in pandas.

There is no upstream issue that matches directly, but related issues: pandas-dev/pandas#46820, pandas-dev/pandas#40080

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions