Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG pd.get_dummies does not propagate NaN correctly #15923

Open
beckermr opened this issue Apr 6, 2017 · 18 comments
Open

BUG pd.get_dummies does not propagate NaN correctly #15923

beckermr opened this issue Apr 6, 2017 · 18 comments
Labels
Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@beckermr
Copy link

beckermr commented Apr 6, 2017

Code Sample, a copy-pastable example if possible

In [17]: s
Out[17]: 
0      a
1      b
2      c
3    NaN
dtype: object
In [18]: pd.get_dummies(s)
Out[18]: 
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  0  0  0

Problem description

The current implementation does not propagate NaN correctly. NaN means the data is missing. So if you turn missing data to all zeros when one-hot encoding, you are asserting that the proper label is none of the labels you have. In reality, the proper label could be one of the labels you have, you just do not know the proper label.

I realize people use the dummy_na option here, but if that option is not passed, then the output should have the NaNs put in the right spot.

Expected Output

It should propagate the NaNs.

In [18]: pd.get_dummies(s)
Out[18]: 
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  NaN  NaN  NaN

Output of pd.show_versions()

In [19]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 33.1.1.post20170320
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
statsmodels: 0.8.0
xarray: None
IPython: 5.3.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.2
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.9.5
boto: 2.46.1
pandas_datareader: None

@beckermr beckermr changed the title pd.get_dummies does not propagate NaN correctly BUG pd.get_dummies does not propagate NaN correctly Apr 6, 2017
@jreback
Copy link
Contributor

jreback commented Apr 6, 2017

dummy_na : bool, default False
    Add a column to indicate NaNs, if False NaNs are ignored.
In [1]: s = Series(['a', 'b', 'c', np.nan])

In [2]: pd.get_dummies(s)
Out[2]: 
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  0  0  0

In [3]: pd.get_dummies(s, dummy_na=True)
Out[3]: 
   a  b  c  NaN
0  1  0  0    0
1  0  1  0    0
2  0  0  1    0
3  0  0  0    1

I think this is pretty clear.

@jreback jreback closed this as completed Apr 6, 2017
@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Usage Question labels Apr 6, 2017
@jreback jreback added this to the No action milestone Apr 6, 2017
@beckermr
Copy link
Author

beckermr commented Apr 6, 2017

Nope. I am well aware of the dummy_na option, as I stated above in my description. This bug is when this option is not used.

@TomAugspurger
Copy link
Contributor

This bug is when this option is not used.

I wouldn't call it a bug though, it's behaving as intended and documented (whether or not that intention is "correct").

Do you have an example use-case where you want the NaNs to propagate? I've only ever used get_dummies as a pre-processing step for fitting a model, so I've already handled NaNs, or want the dummy_na=True behavior.

@beckermr
Copy link
Author

beckermr commented Apr 6, 2017

I am happy with calling it an API change rather than a bug! :)

I would say two things:

  1. The way the API is defined for get_dummies is not logically consistent with how NaN is used in basically all other parts of the pandas API. Internal and logical consistency of the pandas API is certainly enough of a reason in and of itself to motivate the change. There does not have to be a use case to desire API consistency. In fact, it is usually the opposite. There generally there has to be a compelling use case to demand API inconsistency.

  2. The example I am thinking of is working on modeling problems where you want to explicitly handle missing labels in some way that does not involve using an extra dummy column or setting them to all zeros. This could include multilabel classification with some missing labels where you would not discard examples with incomplete label sets.

Right now, to do this, you have to force pandas to create the dummy column and then do the replacement afterwards. This is, of course, not a big deal. However, as stated in 1) above, the purity of the API is a much better motivation for this change.

@jreback
Copy link
Contributor

jreback commented Apr 6, 2017

how does scikit-learn handle missing values in LabelEncoder? (e.g. what is the policy)

@beckermr
Copy link
Author

beckermr commented Apr 6, 2017

@jreback IMO, this is a separate issue that is not terribly pertinent to pandas API purity.

In any case, the standard label encoder throws a TypeError. Scikit-learn does have facilities for handling missing labels. See the label propagation routines here. They use -1 to mark missing labels under the assumption that all labels have to be non-negative integers.

In principle, the LabelEncoder could do the same thing, but given that support for multi-label classification is somewhat case-by-case, I think that throwing an error is probably better.

@beckermr
Copy link
Author

beckermr commented Apr 6, 2017

I don't know if scikit-learn has an official policy for missing labels per se. However, pandas does have a policy for missing things IIUIC. Put in np.nan.

@TomAugspurger
Copy link
Contributor

Would expanding dummy_na to take True / False / 'keep' (or propagate) work? With the third option being your desired output? I don't think this is worth breaking API for, but your use-case seems worth supporting.

@beckermr
Copy link
Author

beckermr commented Apr 6, 2017

Sure that would work! I do feel the need to point out that this change would fix, not break the API, but this is all semantics.

@TomAugspurger TomAugspurger reopened this Apr 6, 2017
@TomAugspurger
Copy link
Contributor

Ok, if you could submit a PR allowing dummy_na to take a third value. Thanks.

@jreback
Copy link
Contributor

jreback commented Apr 6, 2017

@beckermr sure its pertinent. We don't generally want to have a completely opposite policy to downstream (assuming it makes sense).

@jreback jreback changed the title BUG pd.get_dummies does not propagate NaN correctly BUG pd.get_dummies does not propagate NaN correctly Apr 6, 2017
@jorisvandenbossche
Copy link
Member

There generally there has to be a compelling use case to demand API inconsistency.

The fact that pd.get_dummies is used a lot as preprocessing before scikit-learn (or other modelling tools), and such tools typically don't want NaNs, is IMO a good reason for the API

@beckermr
Copy link
Author

beckermr commented Apr 6, 2017

The current API makes too many assumptions. Right now, the current API asserts that if a label is missing, it is NONE of the labels you have. This cannot possibly be true in all cases.

Instead, if NaNs were always propagated by default (even if an additional missing label column is added) then users would be required to reason about why their labels are missing and how they should fill nulls.

For example, if labels are missing at random and the rest of the data is representative, then filling the missing one hot encoded rows with the fraction of each label would be a pretty good way to fill nulls.

Of course pandas should not do this either since it should not be making assumptions about the input data.

@jreback
Copy link
Contributor

jreback commented Apr 6, 2017

acutally this flag should be dropna=True for compat with the rest of pandas (we can of course discuss changing it).

In fact, you could make it dropna=None as a default, then it would be easy to change.

@jreback
Copy link
Contributor

jreback commented Apr 6, 2017

so a natural operation would be:

df = ......
pd.get_dummies(df[column], dropna=False).fillna(-1)

is quite idiomatic.

@beckermr
Copy link
Author

beckermr commented Apr 6, 2017

Hmmm dropna? The rows for the missing examples are not dropped, they are just all set to zero right now.

The closest thing I know of is fillna with values or something.

Happy to change it ofc.

@beckermr
Copy link
Author

beckermr commented Apr 6, 2017

I found some example of dropna in the code base finally. This is very good suggestion. Let's go with that.

@beckermr
Copy link
Author

beckermr commented Apr 6, 2017

Unfortunately this would have the weird side effect that some people would expect dropna=True to actually drop rows, which is weird for this transformation.

I think this fact illustrates the way the current API is actually inconsistent.

Let me know what you all would like to do.

@jreback jreback modified the milestones: Next Major Release, No action Apr 7, 2017
@jreback jreback added Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Apr 7, 2017
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@rhshadrach rhshadrach added the Needs Discussion Requires discussion from core team before further action label Oct 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants