Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wide_to_long with NaNs in i argument #31976

Open
MalteMax opened this issue Feb 14, 2020 · 1 comment
Open

wide_to_long with NaNs in i argument #31976

MalteMax opened this issue Feb 14, 2020 · 1 comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@MalteMax
Copy link

Code and output:

import numpy as np
import pandas as pd
df1 = pd.DataFrame({'id' : ['a', np.NaN, 'c'],
                   'A_1999' : [1, 2, 3],
                   'A_2000' : [5, 6, 7]})

df1
    id  A_1999  A_2000
0    a       1       5
1  NaN       2       6
2    c       3       7

df2 = pd.wide_to_long(df1,
                      stubnames = ['A'],
                      sep = '_',
                      i = 'id',
                      j = 'year').reset_index()

df2
  id  year  A
0  a  1999  1
1  a  1999  2
2  c  1999  3
3  a  2000  5
4  a  2000  6
5  c  2000  7

Problem description

df1 has one NaN in the id column. When applying wide_to_long to df1, wide_to_long seems to fill the NaNs from df1 with the preceding value from the df1 dataframe (namely, a).

Why might this be a problem: instances of id == 'a' in df1 are not the same as in df2 (because the NaNs from df1 have been replaced by a in df2):

df1.loc[df1['id'] == 'a',]

  id  A_1999  A_2000
0  a       1       5

versus:

df2.loc[df2['id'] == 'a',]

  id  year  A
0  a  1999  1
1  a  1999  2
3  a  2000  5
4  a  2000  6

Expected Output

I would expect wide_to_long to either

  1. throw an error that the column that is passed to its i argument contains NaNs, or
  2. preserve the NaN case in the second dataframe (not sure whether this is desirable), i.e.:
    id  year  A
0    a  1999  1
1  NaN  1999  2
2    c  1999  3
3    a  2000  5
4  NaN  2000  6
5    c  2000  7

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0.post20200210
Cython : 0.29.15
pytest : 5.3.5
hypothesis : 5.4.1
sphinx : 1.8.5
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.8.3 (dt dec pq3 ext lo64)
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.3
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.5
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.43.1

@MarcoGorelli MarcoGorelli added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Feb 18, 2020
@mroeschke mroeschke added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jul 28, 2021
@ljluestc
Copy link


import numpy as np
import pandas as pd

# Create the original DataFrame with NaN
df1 = pd.DataFrame({'id': ['a', np.NaN, 'c'],
                    'A_1999': [1, 2, 3],
                    'A_2000': [5, 6, 7]})

# Store the original NaN ids for reference
nan_ids = df1[df1['id'].isna()]

# Apply wide_to_long transformation
df2 = pd.wide_to_long(df1,
                      stubnames=['A'],
                      sep='_',
                      i='id',
                      j='year').reset_index()

# Restore NaNs in the id column of df2 where they were originally in df1
for index, row in nan_ids.iterrows():
    # Find the corresponding rows in df2 to set id back to NaN
    df2.loc[(df2['year'] == 1999) & (df2['A'] == row['A_1999']), 'id'] = np.NaN
    df2.loc[(df2['year'] == 2000) & (df2['A'] == row['A_2000']), 'id'] = np.NaN

# Display the transformed DataFrame
print(df2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

4 participants