Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Setting column on empty DataFrame with loc / avoiding SettingWithCopyWarning for potentially empty DataFrames/copies/views #41891

Closed
3 tasks done
klieret opened this issue Jun 9, 2021 · 6 comments · Fixed by #56614
Labels
Bug Copy / view semantics Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action Warnings Warnings that appear or should be added to pandas

Comments

@klieret
Copy link
Contributor

klieret commented Jun 9, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import pandas as pd


df = pd.DataFrame()

# Option 1: works on empty dataframes (adding an empty column)
# but shows the SettingWithCopyWarning for non-empty views/copies
df["a"] = 1

# Option 2: works without warning for views/copies but raises ValueError on empty dataframe
df.loc[:, "a"] = 1

Problem description

Let's consider a function add_column that adds a column.

  • If we use df[column] = value (Option 1), then the function will throw the SettingWithCopyWarning whenever it is called on a copy/view (even if we don't care about propagating the change to the original dataframe).
  • The recommended workaround for this warning is to use df.loc[:, column] = value (Option 2). However, this throws as soon as the dataframe is empty, i.e. doesn't contain any rows

This then requires ugly solutions like the following

def add_column(df):
    if df.empty:
        # Still want to make sure to add the column to avoid KeyErrors later
        df["column"] = 1  # doesn't show SettingWithCopyWarning
        return
    df.loc[:, "column"] = 1

whenever we might be dealing with dataframes or their copies/views that are possibly empty.

INSTALLED VERSIONS

commit : 2cb9652
python : 3.7.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-64-generic
Version : #58-Ubuntu SMP Fri Jul 10 19:33:51 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.4
numpy : 1.18.1
pytz : 2019.2
dateutil : 2.7.3
pip : 20.3.3
setuptools : 41.1.0
Cython : None
pytest : 5.3.2
hypothesis : None
sphinx : 3.0.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 5.8.0
pandas_datareader: None
bs4 : 4.9.1
bottleneck : None
fsspec : 0.7.4
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 3.0.0
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.16
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.17.0
xlrd : 2.0.1
xlwt : None
numba : 0.51.2

@klieret klieret added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 9, 2021
@klieret klieret changed the title BUG: Setting column on empty DataFrame with loc / avoiding SettingWithCopyWarning for potentially empty DataFrames BUG: Setting column on empty DataFrame with loc / avoiding SettingWithCopyWarning for potentially empty DataFrames/copies/views Jun 9, 2021
@phofl
Copy link
Member

phofl commented Jun 9, 2021

What do you want to achieve with df["a"] = 1?

@klieret
Copy link
Contributor Author

klieret commented Jun 9, 2021

While setting everything to 1 might not sound interesting, oftentimes one wants to set a default value to a column, e.g. np.nan. This is about the general question of how to assign a value to a new column such that

  1. If the dataframe is empty, the column is still added (to avoid KeyErrors later on, if we e.g. do df["b"] = df["a"]+1 afterwards). This is very important if the code contains many queries/selection steps such that empty dataframes might occur at virtually any point.
  2. If we have a "normal", non-empty dataframe the value is assigned to the column of course
  3. If the dataframe is a view/copy, we assign the value to the column of the view/copy without the SettingWithCopyWarning. This is important for writing practical functions that can also take e.g. the result of a query

add_column in the opening post is an implementation of a such a behavior, but it's a lot of boilerplate for such a basic task.

Let me know if I'm missing anything.

No matter the context though, I think that the ValueError for the empty dataframe is also very unexpected from a user perspective.

@klieret
Copy link
Contributor Author

klieret commented Jun 10, 2021

Note: I've updated my last comment a bit. Indeed the problem doesn't occur when assigning series, arrays, because they'd have to be empty themselves, but the issue with scalar values is still an annoying one, which causes many bugs, because empty dataframes usually are an edge case.

The ValueError is raised here:

pandas/pandas/core/indexing.py

Lines 1648 to 1659 in 499ef8c

# add the new item, and set the value
# must have all defined axes if we have a scalar
# or a list-like on the non-info axes if we have a
# list-like
if not len(self.obj):
if not is_list_like_indexer(value):
raise ValueError(
"cannot set a frame with no "
"defined index and a scalar"
)
self.obj[key] = value
return

I will see if I can come up with a possible solution myself, but do let me know if there are reasons that this shouldn't be fixed. I see that this behaviour is currently explicitly tested, but I wonder why:

msg = "cannot set a frame with no defined index and a scalar"
with pytest.raises(ValueError, match=msg):
df.loc[:, 1] = 1

@klieret
Copy link
Contributor Author

klieret commented Jun 10, 2021

The check was added in 7bbeb79 in PR #5227 addressing issue #5226 ("New appending behavior doesn't work on an empty DataFrame")

@klieret
Copy link
Contributor Author

klieret commented Jun 10, 2021

Let me add @jreback to the conversation who added the check :)

@mroeschke mroeschke added Copy / view semantics Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action Warnings Warnings that appear or should be added to pandas and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 17, 2021
@char101
Copy link

char101 commented Dec 14, 2022

This also make updating columns more complicated. For example I want to update an OHLC dataframe where volume = 0 to the close value since the open, high, and low value might be 0.

df  = pd.DataFrame({
  'open':   [ 100,  0, 200],
  'high':   [ 105,  0, 200],
  'low':    [  95,  0, 200],
  'close':  [  95, 95, 200],
  'volume': [1000,  0, 500],
})
# this used to work
df.loc[df['volume'] == 0, ['open', 'high', 'low']] = df['close'] # <- ValueError : shape mismatch if the loc result is empty

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Copy / view semantics Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action Warnings Warnings that appear or should be added to pandas
Projects
None yet
4 participants