-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CoW warning mode: add warning for single block setitem #55838
CoW warning mode: add warning for single block setitem #55838
Conversation
da2a9bd
to
b570356
Compare
def _get_item_cache(self, item: Hashable) -> Series: | ||
"""Return the cached item, item represents a label indexer.""" | ||
if using_copy_on_write(): | ||
if using_copy_on_write() or warn_copy_on_write(): | ||
loc = self.columns.get_loc(item) | ||
return self._ixs(loc, axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed to make a decision here about what to do with the item cache in the warning mode.
In the end, the current version of the PR updates the warning mode to also not populate and use the item cache, like for Cow mode.
This will cause some behaviour changes when enabling the warning mode. But it also avoids a bunch of otherwise false-positive warnings. So it's a trade-off.
Examples that would otherwise give false positives:
df = ..
df["col"].mean()
# now mutating `df` would trigger CoW / raise warning because the df["col"] series is cached
or whenever we access columns under the hood in a method, like df2 = df.astype({..})
, even when df2 fully consists of copied data because there was an actual cast, df
will have referenced because of the item cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine by me
@@ -12392,7 +12392,7 @@ def _inplace_method(self, other, op) -> Self: | |||
""" | |||
warn = True | |||
if not PYPY and warn_copy_on_write(): | |||
if sys.getrefcount(self) <= 5: | |||
if sys.getrefcount(self) <= 4: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you arrive at the conclusion to change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From testing in practice, but my assumption is that it is the consequence of removing the item cache, which means one reference less (so this change sounds logical in hindsight)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah that makes a lot of sense
Thx
df = DataFrame({"a": [1, 2], "b": 1}) | ||
df = df.set_index("a", drop=False) | ||
expected = df.index.copy(deep=True) | ||
df.iloc[0, 0] = 100 | ||
with tm.assert_cow_warning(warn_copy_on_write): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a false positive to me, shouldn't set_index copy the data in non-Cow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't set_index copy the data in non-Cow?
You would assume so, but no .. (that's kind of a horrible bug I would say):
In [15]: df = DataFrame({"a": [1, 2], "b": 1})
In [16]: df = df.set_index("a", drop=False)
In [17]: df
Out[17]:
a b
a
1 1 1
2 2 1
In [18]: df.loc[1, 'a'] = 100
In [19]: df
Out[19]:
a b
a
100 100 1
2 2 1
See #42934 (it seems I actually have an old draft PR for this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yikes this is terrible and unexpected.
df = DataFrame( | ||
[[1, 2], [3, 4], [5, 6]], columns=date_range("2020-01-01", "2020-01-02") | ||
) | ||
df2 = df.shift(periods=1, axis=1) | ||
|
||
assert np.shares_memory(get_array(df2, "2020-01-02"), get_array(df, "2020-01-01")) | ||
df.iloc[0, 0] = 0 | ||
with tm.assert_cow_warning(warn_copy_on_write): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is shift a view in non-cow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for axis=1 (shifting rows would never be possible as a view), and apparently yes:
In [20]: df = DataFrame(
...: [[1, 2], [3, 4], [5, 6]], columns=date_range("2020-01-01", "2020-01-02")
...: )
...: df2 = df.shift(periods=1, axis=1)
In [21]: df
Out[21]:
2020-01-01 2020-01-02
0 1 2
1 3 4
2 5 6
In [22]: df2
Out[22]:
2020-01-01 2020-01-02
0 NaN 1
1 NaN 3
2 NaN 5
In [23]: df2.iloc[0, 1] = 100
In [24]: df
Out[24]:
2020-01-01 2020-01-02
0 100 2
1 3 4
2 5 6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unexpected, thx
@@ -60,7 +60,9 @@ def test_cache_updating(using_copy_on_write): | |||
df.loc[0]["z"].iloc[0] = 1.0 | |||
assert df.loc[(0, 0), "z"] == df_original.loc[0, "z"] | |||
else: | |||
df.loc[0]["z"].iloc[0] = 1.0 | |||
# TODO(CoW-warn) should raise custom warning message about chaining? | |||
with tm.assert_cow_warning(warn_copy_on_write): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could just enable the ChainedAssignmentError? Would probably have to adjust the message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could just enable the ChainedAssignmentError? Would probably have to adjust the message
That's a good point. I am tackling warnings for chained setitem in #55522, and there I was now using a "FutureWarning: ChainedAssignmentError ..." (so using it as the first word of the message, to make it stand out). I could also use the actual warning class, if we are fine that it then is no FutureWarning (given this is not a "typical" deprecation, but one to explicitly opt in to, that's probably fine).
Merging, you can enable the chained assignment case in a follow up if we think that's a good idea |
Follow-up on #55428