wide_to_long with NaNs in i argument #31976

MalteMax · 2020-02-14T15:49:45Z

Code and output:

import numpy as np
import pandas as pd
df1 = pd.DataFrame({'id' : ['a', np.NaN, 'c'],
                   'A_1999' : [1, 2, 3],
                   'A_2000' : [5, 6, 7]})

df1
    id  A_1999  A_2000
0    a       1       5
1  NaN       2       6
2    c       3       7

df2 = pd.wide_to_long(df1,
                      stubnames = ['A'],
                      sep = '_',
                      i = 'id',
                      j = 'year').reset_index()

df2
  id  year  A
0  a  1999  1
1  a  1999  2
2  c  1999  3
3  a  2000  5
4  a  2000  6
5  c  2000  7

Problem description

df1 has one NaN in the id column. When applying wide_to_long to df1, wide_to_long seems to fill the NaNs from df1 with the preceding value from the df1 dataframe (namely, a).

Why might this be a problem: instances of id == 'a' in df1 are not the same as in df2 (because the NaNs from df1 have been replaced by a in df2):

df1.loc[df1['id'] == 'a',]

  id  A_1999  A_2000
0  a       1       5

versus:

df2.loc[df2['id'] == 'a',]

  id  year  A
0  a  1999  1
1  a  1999  2
3  a  2000  5
4  a  2000  6

Expected Output

I would expect wide_to_long to either

throw an error that the column that is passed to its i argument contains NaNs, or
preserve the NaN case in the second dataframe (not sure whether this is desirable), i.e.:

    id  year  A
0    a  1999  1
1  NaN  1999  2
2    c  1999  3
3    a  2000  5
4  NaN  2000  6
5    c  2000  7

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0.post20200210
Cython : 0.29.15
pytest : 5.3.5
hypothesis : 5.4.1
sphinx : 1.8.5
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.8.3 (dt dec pq3 ext lo64)
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.3
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.5
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.43.1

The text was updated successfully, but these errors were encountered:

ljluestc · 2024-10-19T22:25:15Z


import numpy as np
import pandas as pd

# Create the original DataFrame with NaN
df1 = pd.DataFrame({'id': ['a', np.NaN, 'c'],
                    'A_1999': [1, 2, 3],
                    'A_2000': [5, 6, 7]})

# Store the original NaN ids for reference
nan_ids = df1[df1['id'].isna()]

# Apply wide_to_long transformation
df2 = pd.wide_to_long(df1,
                      stubnames=['A'],
                      sep='_',
                      i='id',
                      j='year').reset_index()

# Restore NaNs in the id column of df2 where they were originally in df1
for index, row in nan_ids.iterrows():
    # Find the corresponding rows in df2 to set id back to NaN
    df2.loc[(df2['year'] == 1999) & (df2['A'] == row['A_1999']), 'id'] = np.NaN
    df2.loc[(df2['year'] == 2000) & (df2['A'] == row['A_2000']), 'id'] = np.NaN

# Display the transformed DataFrame
print(df2)

MarcoGorelli added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Feb 18, 2020

mroeschke added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jul 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wide_to_long with NaNs in i argument #31976

wide_to_long with NaNs in i argument #31976

MalteMax commented Feb 14, 2020

INSTALLED VERSIONS

ljluestc commented Oct 19, 2024

wide_to_long with NaNs in i argument #31976

wide_to_long with NaNs in i argument #31976

Comments

MalteMax commented Feb 14, 2020

Code and output:

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

ljluestc commented Oct 19, 2024

Output of `pd.show_versions()`