Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.loc multiple columns replace #30439

Open
fansichao opened this issue Dec 24, 2019 · 15 comments
Open

DataFrame.loc multiple columns replace #30439

fansichao opened this issue Dec 24, 2019 · 15 comments
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@fansichao
Copy link

Python 3.6.8
pandas==0.25.3

#! -*- coding:utf-8 -*-
import pandas as pd
df = pd.DataFrame([
    {'a':'a1','b':'b1','c':'c1','d':'d1','e':'e1'},
    {'a':'a2','b':'b2','c':'c2','d':'d2','e':'e2'}
    ])
import copy
df2 = copy.deepcopy(df)

 

print(df)
#     a   b   c   d   e
# 0  a1  b1  c1  d1  e1
# 1  a2  b2  c2  d2  e2


df.loc[df['a']=='a2', ['c']] = df['e']
print(df)
#     a   b   c   d   e
# 0  a1  b1  c1  d1  e1
# 1  a2  b2  e2  d2  e2

# loc muti columns replace has some problem
df.loc[df['a']=='a2', ['b','c']] = df[['d','e']]
print(df)
#     a    b    c   d   e
# 0  a1   b1   c1  d1  e1
# 1  a2  NaN  NaN  d2  e2
@Liam3851
Copy link
Contributor

Liam3851 commented Dec 31, 2019

You're using label-based indexing using .loc. Pandas therefore does not infer that you want to replace column b with column d and column c with column e-- that would be positional logic.

Either you can use .iloc for this use case, or else rename the columns before setting:

In [6]: df.loc[df['a']=='a2', ['b','c']] = df[['d','e']].rename(columns={'d':'b', 'e':'c'})

In [7]: df
Out[7]:
    a   b   c   d   e
0  a1  b1  c1  d1  e1
1  a2  d2  e2  d2  e2

Alternatively you can also use the setting using .to_numpy, see the warning box at https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#basics. In this case you need to do the alignment on the right hand side yourself:

In [9]: df.loc[df['a']=='a2', ['b','c']] = df.loc[df['a'] == 'a2', ['d','e']].to_numpy()

In [10]: df
Out[10]:
    a   b   c   d   e
0  a1  b1  c1  d1  e1
1  a2  d2  e2  d2  e2

@jbrockmendel jbrockmendel added the Indexing Related to indexing on series/frames, not to indexes themselves label Feb 25, 2020
@phofl
Copy link
Member

phofl commented Nov 23, 2020

I agree with @Liam3851 the weird thing is that df.loc[df['a']=='a2', ['c']] = df['e'] works. This should also set NaN. But probably will be fixed in the future with the split_path fix @jbrockmendel ? This currently runs through _setitem_single_block

@jbrockmendel
Copy link
Member

This should also set NaN.

@phofl can you elaborate on what behavior you expect?

I think id expect df.loc[df['a']=='a2', ['c']] = df['e'] to raise ValueError because of a shape mismatch.

@phofl
Copy link
Member

phofl commented Nov 24, 2020

I would have expected that _align_series sets the value to NaN, similarly to

indexer = df['a'] == 'a2'
rhs = df[['e']]
df.loc[indexer, ['c']] = rhs

Why would you expect shape missmatch?

@jbrockmendel
Copy link
Member

Why would you expect shape missmatch?

well df.loc[indexer, ["c"]] has shape (1, 1) and rhs has shape (2, 1) (just (2,) if we use df["e"] instead of df[["e"]]). To not get a shape mismatch, I'd expect to see df.loc[indexer, ["c"]] = rhs.loc[indexer]

@phofl
Copy link
Member

phofl commented Nov 24, 2020

Ah I see. My understanding of __setitem__ for loc was, that it would take care of the filtered rows (e.g. the _align_series and _align_dataframe are doing this?), so that the rhs is filtered for the same rows as the lhs. Or is this an accident and should change in the future?

@jbrockmendel
Copy link
Member

Its entirely plausible that im wrong on this.

If instead of df.loc[indexer, ["c"]] = df["e"] we did df.loc[indexer, ["c"]] = df["e"]._values that does raise like I would expect. Would you expect the filtering to behave differently there?

@phofl
Copy link
Member

phofl commented Nov 24, 2020

In case of _values None of the align functions is called, because we got a Numpy array.

I quite like the feature, that you do not have to specify the same loc condition on the rhs but it gets filtered nevertheless. You could probably construct examples where this is a disadvantage instead of an advantage. But I think we should not break this code without warning, if we decide to do this. Don't know if this was intended with the implementation or only a side effect.

@Liam3851
Copy link
Contributor

@jbrockmendel Isn't the difference that in df.loc[indexer, ["c"]] = df["e"] the rhs has an index that is compatible with the lhs (and so an align operation works to make the rhs compatible with the lhs)? In the case of df.loc[indexer, ["c"]] = df["e"]._values you have stripped the values out, losing the index, and thus making alignment impossible.

@phofl I think we need to differentiate:
df.loc[indexer, ["c"]] = df["e"]

from
df.loc[indexer, ["c"]] = df[["e"]]

The first one, you are assigning a Series to a DataFrame. This realigns to the df's index and then broadcasts across the DataFrame with all columns selected.

For example this is totally fine and assigns columns a and b to the values of column e:
df.loc[indexer, ["a", "b"]] = df["e"]

In the second, you are assigning a DataFrame to a DataFrame. Because the columns are not compatible they get assigned as NaN.

@phofl
Copy link
Member

phofl commented Nov 24, 2020

Yep you are right. I would expect

df.loc[indexer, "c"] = df["e"]

to work.

df.loc[indexer, ["c"]] = df["e"]

seems weird.

@Liam3851
Copy link
Contributor

I agree

df.loc[df['a']=='a2', ['c']] = df['e']

looks a bit weird (you'd probably spell it the first way you gave above), but do you agree that

df.loc[df['a'] == 'a2', ['a', 'b']] = df['e']

makes sense as a broadcast (broadcasting the Series across the DataFrame)? If so then

df.loc[df['a']=='a2', ['c']] = df['e']

more or less has to be supported (or at least it would be weird not to support it).

@jbrockmendel
Copy link
Member

I still find it counter-intuitive that both df.loc[indexer, ["c"]] = rhs and df.loc[indexer, ["c"]] = rhs.loc[indexer] would behave identically. We should figure out whether this is intentional or not, and @phofl is right if it isn't intentional, it needs to be deprecated instead of removed immediately.

@phofl
Copy link
Member

phofl commented Nov 25, 2020

Yep I would agree with counter-intuitive. Just thought this is a nice feature, if it is not a bug:)

@jbrockmendel
Copy link
Member

@jorisvandenbossche can you weigh on in what the intended behavior is here (no hurry)

@phofl
Copy link
Member

phofl commented Nov 29, 2020

related to #10440

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

5 participants