Skip to content

REF: Add Manager.column_setitem to set values into a single column (without intermediate series) #47074

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Changes from 1 commit
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
handle typing
  • Loading branch information
jorisvandenbossche committed May 20, 2022
commit a2aa8aa8517a1970224e2073eda708f01988afe0
10 changes: 6 additions & 4 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -3924,15 +3924,17 @@ def _set_value(
Sets whether or not index/col interpreted as indexers
"""
try:
# setitem_inplace will do validation that may raise TypeError,
# setitem will do validation that may raise TypeError,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this comment refer to column_setitem instead of just setitem?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it also looks like the new method doesn't actually do the validation this comment refers to?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an existing comment (originally a few lines below), but so I suppose this comment was actually already not up to date anymore.
So before this PR the comment was about setitem_inplace, and that also doesn't do any validation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setitem_inplace calls np_can_hold_element, which raises on failure

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, I see. That is doing validation that can raise the LossySetitemError. So the new column_setitem calls setitem, which will also call np_can_hold_element, but there catching the LossySetitemError and coercing to the target dtype if needed.
That is something that the loc/iloc fallback below otherwise will also do, so I suppose this change is OK (but the comment is then indeed no longer correct, and we also don't need to catch LossySetitemError here)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we also don't need to catch LossySetitemError here

yah, possibly also TypeError and ValueError

# ValueError, or LossySetitemError
# breakpoint()
if takeable:
self._mgr.column_setitem(col, index, value)
# error: Argument 2 to "column_setitem" of "BlockManager" has
# incompatible type "Union[Hashable, Sequence[Hashable]]";
# expected "Union[int, slice, ndarray[Any, Any]]"
self._mgr.column_setitem(col, index, value) # type: ignore[arg-type]
else:
icol = self.columns.get_loc(col)
index = self.index.get_loc(index)
self._mgr.column_setitem(icol, index, value)
self._mgr.column_setitem(icol, index, value) # type: ignore[arg-type]
self._clear_item_cache()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we make column_setitem always-inplace and make the clear_item_cache unecesssary?

API-wise i think the always-inplace method is a lot nicer than the less predictable one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe could make setitem_inplace ignore CoW since it is explicitly inplace?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we make column_setitem always-inplace and make the clear_item_cache unecesssary?

The main use case of column_setitem is in _iLocIndexer._setitem_single_column, which is used for setting with loc/iloc. And for that use case, we need this to be not inplace (i.e. having the dtype coercing behaviour), since that is what we need for loc/iloc.

The case here is only for at/iat setting.

I could add a separate inplace version or add an inplace keyword to column_setitem that could be used here. That would keep the current logic more intact, but since we fallback to loc/iloc anyway when the inplace setitem fails, I am not sure it would actually be very useful.

or maybe could make setitem_inplace ignore CoW since it is explicitly inplace?

Even something that is explicitly inplace from a usage perspective will need to take care of CoW in #46958

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a version with an inplace keyword in the last commit (453eaba). I lean more towards "not worth it", but either way is fine for me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel do you have a preference on keeping this change with the inplace keyword for column_setitem or not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it seems we still need to catch TypeError, for example for a MultiIndex case where self.columns.get_loc(col) might not necessarily result in an integer.

So only removed LossySetitemError for now (and added a test case for setting with at with a MultiIndex, as that didn't yet seem to be covered in the indexing tests)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, can you add a comment about why TypeError is needed. what about ValueError?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add a comment about TypeError.
I am not fully sure about the ValueError. Is it possible for a setitem operation to raise a ValueError? (it seems that validation methods (like _validate_setitem_value) will mostly raise TypeErrors?)

Now, this catching of a ValueError alraedy was here before, so I am hesitant to remove it without looking further in detail at it. I would prefer to leave that for another PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there are some cases where setitem actually raises a ValueError, eg when setting with an array-like that is not of length 1 (in this case of scalar indexer at or iat).
Now, the fallback to loc/iloc will then most likely also raise a ValueError (so catching it might not necessarily add that much). But at least in some cases it seems to change the error message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example:

>>> df = pd.DataFrame({'a': pd.Series([1, 2, 3], dtype="Int64"), 'b': [0.1, 0.2, 0.3]})
>>> df.iat[0, 0] = np.array([0, 1])
...
ValueError: Must have equal len keys and value when setting with an iterable

Without catching/reraising in iloc, the error would be "ValueError: setting an array element with a sequence", which is slightly less informative.

I suppose we should actually see to make this handling consistent in the different code paths (so it directly raises the more informative error message in the first place), but that's out of scope for this PR.


except (KeyError, TypeError, ValueError, LossySetitemError):
Expand Down