Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoW warning mode: add warning for single block setitem #55838

Merged

Conversation

jorisvandenbossche
Copy link
Member

Follow-up on #55428

@jorisvandenbossche jorisvandenbossche force-pushed the cow-warnings-setitem-single-block branch from da2a9bd to b570356 Compare November 6, 2023 15:38
Comment on lines 4540 to 4544
def _get_item_cache(self, item: Hashable) -> Series:
"""Return the cached item, item represents a label indexer."""
if using_copy_on_write():
if using_copy_on_write() or warn_copy_on_write():
loc = self.columns.get_loc(item)
return self._ixs(loc, axis=1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to make a decision here about what to do with the item cache in the warning mode.

In the end, the current version of the PR updates the warning mode to also not populate and use the item cache, like for Cow mode.

This will cause some behaviour changes when enabling the warning mode. But it also avoids a bunch of otherwise false-positive warnings. So it's a trade-off.

Examples that would otherwise give false positives:

df =  ..
df["col"].mean()
# now mutating `df` would trigger CoW / raise warning because the df["col"] series is cached

or whenever we access columns under the hood in a method, like df2 = df.astype({..}), even when df2 fully consists of copied data because there was an actual cast, df will have referenced because of the item cache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review November 7, 2023 10:40
@@ -12392,7 +12392,7 @@ def _inplace_method(self, other, op) -> Self:
"""
warn = True
if not PYPY and warn_copy_on_write():
if sys.getrefcount(self) <= 5:
if sys.getrefcount(self) <= 4:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you arrive at the conclusion to change this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From testing in practice, but my assumption is that it is the consequence of removing the item cache, which means one reference less (so this change sounds logical in hindsight)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah that makes a lot of sense

Thx

df = DataFrame({"a": [1, 2], "b": 1})
df = df.set_index("a", drop=False)
expected = df.index.copy(deep=True)
df.iloc[0, 0] = 100
with tm.assert_cow_warning(warn_copy_on_write):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a false positive to me, shouldn't set_index copy the data in non-Cow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't set_index copy the data in non-Cow?

You would assume so, but no .. (that's kind of a horrible bug I would say):

In [15]: df = DataFrame({"a": [1, 2], "b": 1})

In [16]: df = df.set_index("a", drop=False)

In [17]: df
Out[17]: 
   a  b
a      
1  1  1
2  2  1

In [18]: df.loc[1, 'a'] = 100

In [19]: df
Out[19]: 
       a  b
a          
100  100  1
2      2  1

See #42934 (it seems I actually have an old draft PR for this)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes this is terrible and unexpected.

df = DataFrame(
[[1, 2], [3, 4], [5, 6]], columns=date_range("2020-01-01", "2020-01-02")
)
df2 = df.shift(periods=1, axis=1)

assert np.shares_memory(get_array(df2, "2020-01-02"), get_array(df, "2020-01-01"))
df.iloc[0, 0] = 0
with tm.assert_cow_warning(warn_copy_on_write):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is shift a view in non-cow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for axis=1 (shifting rows would never be possible as a view), and apparently yes:

In [20]:     df = DataFrame(
    ...:         [[1, 2], [3, 4], [5, 6]], columns=date_range("2020-01-01", "2020-01-02")
    ...:     )
    ...:     df2 = df.shift(periods=1, axis=1)

In [21]: df
Out[21]: 
   2020-01-01  2020-01-02
0           1           2
1           3           4
2           5           6

In [22]: df2
Out[22]: 
   2020-01-01  2020-01-02
0         NaN           1
1         NaN           3
2         NaN           5

In [23]: df2.iloc[0, 1] = 100

In [24]: df
Out[24]: 
   2020-01-01  2020-01-02
0         100           2
1           3           4
2           5           6

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unexpected, thx

@@ -60,7 +60,9 @@ def test_cache_updating(using_copy_on_write):
df.loc[0]["z"].iloc[0] = 1.0
assert df.loc[(0, 0), "z"] == df_original.loc[0, "z"]
else:
df.loc[0]["z"].iloc[0] = 1.0
# TODO(CoW-warn) should raise custom warning message about chaining?
with tm.assert_cow_warning(warn_copy_on_write):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just enable the ChainedAssignmentError? Would probably have to adjust the message

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just enable the ChainedAssignmentError? Would probably have to adjust the message

That's a good point. I am tackling warnings for chained setitem in #55522, and there I was now using a "FutureWarning: ChainedAssignmentError ..." (so using it as the first word of the message, to make it stand out). I could also use the actual warning class, if we are fine that it then is no FutureWarning (given this is not a "typical" deprecation, but one to explicitly opt in to, that's probably fine).

@phofl
Copy link
Member

phofl commented Nov 10, 2023

Merging, you can enable the chained assignment case in a follow up if we think that's a good idea

@phofl phofl merged commit 09ed69e into pandas-dev:main Nov 10, 2023
@phofl
Copy link
Member

phofl commented Nov 10, 2023

thx @jorisvandenbossche

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants