BUG pd.get_dummies does not propagate `NaN` correctly #15923

beckermr · 2017-04-06T15:54:55Z

Code Sample, a copy-pastable example if possible

In [17]: s
Out[17]: 
0      a
1      b
2      c
3    NaN
dtype: object
In [18]: pd.get_dummies(s)
Out[18]: 
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  0  0  0

Problem description

The current implementation does not propagate NaN correctly. NaN means the data is missing. So if you turn missing data to all zeros when one-hot encoding, you are asserting that the proper label is none of the labels you have. In reality, the proper label could be one of the labels you have, you just do not know the proper label.

I realize people use the dummy_na option here, but if that option is not passed, then the output should have the NaNs put in the right spot.

Expected Output

It should propagate the NaNs.

In [18]: pd.get_dummies(s)
Out[18]: 
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  NaN  NaN  NaN

Output of `pd.show_versions()`

In [19]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 33.1.1.post20170320
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
statsmodels: 0.8.0
xarray: None
IPython: 5.3.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.2
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.9.5
boto: 2.46.1
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-04-06T16:11:38Z

dummy_na : bool, default False
    Add a column to indicate NaNs, if False NaNs are ignored.

In [1]: s = Series(['a', 'b', 'c', np.nan])

In [2]: pd.get_dummies(s)
Out[2]: 
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  0  0  0

In [3]: pd.get_dummies(s, dummy_na=True)
Out[3]: 
   a  b  c  NaN
0  1  0  0    0
1  0  1  0    0
2  0  0  1    0
3  0  0  0    1

I think this is pretty clear.

beckermr · 2017-04-06T16:59:32Z

Nope. I am well aware of the dummy_na option, as I stated above in my description. This bug is when this option is not used.

TomAugspurger · 2017-04-06T17:05:35Z

This bug is when this option is not used.

I wouldn't call it a bug though, it's behaving as intended and documented (whether or not that intention is "correct").

Do you have an example use-case where you want the NaNs to propagate? I've only ever used get_dummies as a pre-processing step for fitting a model, so I've already handled NaNs, or want the dummy_na=True behavior.

beckermr · 2017-04-06T17:19:27Z

I am happy with calling it an API change rather than a bug! :)

I would say two things:

The way the API is defined for get_dummies is not logically consistent with how NaN is used in basically all other parts of the pandas API. Internal and logical consistency of the pandas API is certainly enough of a reason in and of itself to motivate the change. There does not have to be a use case to desire API consistency. In fact, it is usually the opposite. There generally there has to be a compelling use case to demand API inconsistency.
The example I am thinking of is working on modeling problems where you want to explicitly handle missing labels in some way that does not involve using an extra dummy column or setting them to all zeros. This could include multilabel classification with some missing labels where you would not discard examples with incomplete label sets.

Right now, to do this, you have to force pandas to create the dummy column and then do the replacement afterwards. This is, of course, not a big deal. However, as stated in 1) above, the purity of the API is a much better motivation for this change.

jreback · 2017-04-06T17:23:49Z

how does scikit-learn handle missing values in LabelEncoder? (e.g. what is the policy)

beckermr · 2017-04-06T17:30:53Z

@jreback IMO, this is a separate issue that is not terribly pertinent to pandas API purity.

In any case, the standard label encoder throws a TypeError. Scikit-learn does have facilities for handling missing labels. See the label propagation routines here. They use -1 to mark missing labels under the assumption that all labels have to be non-negative integers.

In principle, the LabelEncoder could do the same thing, but given that support for multi-label classification is somewhat case-by-case, I think that throwing an error is probably better.

beckermr · 2017-04-06T17:33:46Z

I don't know if scikit-learn has an official policy for missing labels per se. However, pandas does have a policy for missing things IIUIC. Put in np.nan.

TomAugspurger · 2017-04-06T17:35:03Z

Would expanding dummy_na to take True / False / 'keep' (or propagate) work? With the third option being your desired output? I don't think this is worth breaking API for, but your use-case seems worth supporting.

beckermr · 2017-04-06T17:37:28Z

Sure that would work! I do feel the need to point out that this change would fix, not break the API, but this is all semantics.

TomAugspurger · 2017-04-06T17:39:24Z

Ok, if you could submit a PR allowing dummy_na to take a third value. Thanks.

jreback · 2017-04-06T18:02:40Z

@beckermr sure its pertinent. We don't generally want to have a completely opposite policy to downstream (assuming it makes sense).

jorisvandenbossche · 2017-04-06T22:10:56Z

There generally there has to be a compelling use case to demand API inconsistency.

The fact that pd.get_dummies is used a lot as preprocessing before scikit-learn (or other modelling tools), and such tools typically don't want NaNs, is IMO a good reason for the API

beckermr · 2017-04-06T22:18:30Z

The current API makes too many assumptions. Right now, the current API asserts that if a label is missing, it is NONE of the labels you have. This cannot possibly be true in all cases.

Instead, if NaNs were always propagated by default (even if an additional missing label column is added) then users would be required to reason about why their labels are missing and how they should fill nulls.

For example, if labels are missing at random and the rest of the data is representative, then filling the missing one hot encoded rows with the fraction of each label would be a pretty good way to fill nulls.

Of course pandas should not do this either since it should not be making assumptions about the input data.

jreback · 2017-04-06T22:23:26Z

acutally this flag should be dropna=True for compat with the rest of pandas (we can of course discuss changing it).

In fact, you could make it dropna=None as a default, then it would be easy to change.

jreback · 2017-04-06T22:24:17Z

so a natural operation would be:

df = ......
pd.get_dummies(df[column], dropna=False).fillna(-1)

is quite idiomatic.

beckermr · 2017-04-06T22:27:03Z

Hmmm dropna? The rows for the missing examples are not dropped, they are just all set to zero right now.

The closest thing I know of is fillna with values or something.

Happy to change it ofc.

beckermr · 2017-04-06T23:08:27Z

~~I found some example of dropna in the code base finally. This is very good suggestion. Let's go with that.~~

beckermr · 2017-04-06T23:10:16Z

Unfortunately this would have the weird side effect that some people would expect dropna=True to actually drop rows, which is weird for this transformation.

I think this fact illustrates the way the current API is actually inconsistent.

Let me know what you all would like to do.

beckermr changed the title ~~pd.get_dummies does not propagate NaN correctly~~ BUG pd.get_dummies does not propagate NaN correctly Apr 6, 2017

jreback closed this as completed Apr 6, 2017

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Usage Question labels Apr 6, 2017

jreback added this to the No action milestone Apr 6, 2017

TomAugspurger reopened this Apr 6, 2017

jreback changed the title ~~BUG pd.get_dummies does not propagate NaN correctly~~ BUG pd.get_dummies does not propagate NaN correctly Apr 6, 2017

beckermr mentioned this issue Apr 6, 2017

ENH add fill_value feature to pd.get_dummies #15926

Closed

4 tasks

jreback modified the milestones: Next Major Release, No action Apr 7, 2017

jreback added Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Apr 7, 2017

jreback removed the Usage Question label Apr 7, 2017

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

DM-Berger mentioned this issue Oct 4, 2024

DOC: get_dummies behaviour with dummy_na = True is counter-intuitive / incorrect when no NaN values present #59968

Closed

1 task

rhshadrach added the Needs Discussion Requires discussion from core team before further action label Oct 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG pd.get_dummies does not propagate `NaN` correctly #15923

BUG pd.get_dummies does not propagate `NaN` correctly #15923

beckermr commented Apr 6, 2017

INSTALLED VERSIONS

jreback commented Apr 6, 2017

beckermr commented Apr 6, 2017

TomAugspurger commented Apr 6, 2017

beckermr commented Apr 6, 2017

jreback commented Apr 6, 2017

beckermr commented Apr 6, 2017

beckermr commented Apr 6, 2017 •

edited

Loading

TomAugspurger commented Apr 6, 2017

beckermr commented Apr 6, 2017

TomAugspurger commented Apr 6, 2017

jreback commented Apr 6, 2017

jorisvandenbossche commented Apr 6, 2017

beckermr commented Apr 6, 2017

jreback commented Apr 6, 2017 •

edited

Loading

jreback commented Apr 6, 2017

beckermr commented Apr 6, 2017

beckermr commented Apr 6, 2017 •

edited

Loading

beckermr commented Apr 6, 2017

BUG pd.get_dummies does not propagate NaN correctly #15923

BUG pd.get_dummies does not propagate NaN correctly #15923

Comments

beckermr commented Apr 6, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Apr 6, 2017

beckermr commented Apr 6, 2017

TomAugspurger commented Apr 6, 2017

beckermr commented Apr 6, 2017

jreback commented Apr 6, 2017

beckermr commented Apr 6, 2017

beckermr commented Apr 6, 2017 • edited Loading

TomAugspurger commented Apr 6, 2017

beckermr commented Apr 6, 2017

TomAugspurger commented Apr 6, 2017

jreback commented Apr 6, 2017

jorisvandenbossche commented Apr 6, 2017

beckermr commented Apr 6, 2017

jreback commented Apr 6, 2017 • edited Loading

jreback commented Apr 6, 2017

beckermr commented Apr 6, 2017

beckermr commented Apr 6, 2017 • edited Loading

beckermr commented Apr 6, 2017

BUG pd.get_dummies does not propagate `NaN` correctly #15923

BUG pd.get_dummies does not propagate `NaN` correctly #15923

Output of `pd.show_versions()`

beckermr commented Apr 6, 2017 •

edited

Loading

jreback commented Apr 6, 2017 •

edited

Loading

beckermr commented Apr 6, 2017 •

edited

Loading