-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG pd.get_dummies does not propagate NaN
correctly
#15923
Comments
pd.get_dummies
does not propagate NaN
correctlypd.get_dummies
does not propagate NaN
correctly
I think this is pretty clear. |
Nope. I am well aware of the dummy_na option, as I stated above in my description. This bug is when this option is not used. |
I wouldn't call it a bug though, it's behaving as intended and documented (whether or not that intention is "correct"). Do you have an example use-case where you want the NaNs to propagate? I've only ever used |
I am happy with calling it an API change rather than a bug! :) I would say two things:
Right now, to do this, you have to force pandas to create the dummy column and then do the replacement afterwards. This is, of course, not a big deal. However, as stated in 1) above, the purity of the API is a much better motivation for this change. |
how does scikit-learn handle missing values in |
@jreback IMO, this is a separate issue that is not terribly pertinent to pandas API purity. In any case, the standard label encoder throws a In principle, the |
I don't know if scikit-learn has an official policy for missing labels per se. However, pandas does have a policy for missing things IIUIC. Put in |
Would expanding |
Sure that would work! I do feel the need to point out that this change would fix, not break the API, but this is all semantics. |
Ok, if you could submit a PR allowing |
@beckermr sure its pertinent. We don't generally want to have a completely opposite policy to downstream (assuming it makes sense). |
pd.get_dummies
does not propagate NaN
correctlyNaN
correctly
The fact that |
The current API makes too many assumptions. Right now, the current API asserts that if a label is missing, it is NONE of the labels you have. This cannot possibly be true in all cases. Instead, if NaNs were always propagated by default (even if an additional missing label column is added) then users would be required to reason about why their labels are missing and how they should fill nulls. For example, if labels are missing at random and the rest of the data is representative, then filling the missing one hot encoded rows with the fraction of each label would be a pretty good way to fill nulls. Of course pandas should not do this either since it should not be making assumptions about the input data. |
acutally this flag should be In fact, you could make it |
so a natural operation would be:
is quite idiomatic. |
Hmmm dropna? The rows for the missing examples are not dropped, they are just all set to zero right now. The closest thing I know of is fillna with values or something. Happy to change it ofc. |
|
Unfortunately this would have the weird side effect that some people would expect dropna=True to actually drop rows, which is weird for this transformation. I think this fact illustrates the way the current API is actually inconsistent. Let me know what you all would like to do. |
Code Sample, a copy-pastable example if possible
Problem description
The current implementation does not propagate
NaN
correctly.NaN
means the data is missing. So if you turn missing data to all zeros when one-hot encoding, you are asserting that the proper label is none of the labels you have. In reality, the proper label could be one of the labels you have, you just do not know the proper label.I realize people use the
dummy_na
option here, but if that option is not passed, then the output should have theNaN
s put in the right spot.Expected Output
It should propagate the NaNs.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 33.1.1.post20170320
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
statsmodels: 0.8.0
xarray: None
IPython: 5.3.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.2
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.9.5
boto: 2.46.1
pandas_datareader: None
The text was updated successfully, but these errors were encountered: