Skip to content

PERF: Remove _item_cache #50547

Closed
Closed
@lithomas1

Description

@lithomas1

Discussion copied over from #49450

In OP of #49450(discusses turning on the _item_cache for CoW),

Context:

Currently, we use an item cache for DataFrame columns -> Series. Whenever we access a certain column, we cache the resulting Series in df._item_cache, and the next time we access a column, we first check if that column already exists in the cache and if so return that directly. I suppose this was done for making repeated column access faster (although the Series construction overhead for this fast path case also has improved I think). But is also has some behavioral consequences, i.e. Series objects from column access can be identical objects, depending on the context:

>>> df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> s1 = df["a"]
>>> s2 = df["a"]
>>> df['b'] = 10 # set existing column -> clears the item cache
>>> s3 = df["a"]
>>> s1 is s2
True
>>> s1 is s3
False

This caching can also have other side effects, though. In investigating #29411, I found that methods like memory_usage(also looks like round, duplicated, may be affected from a quick glance at frame.py) that iterate through all the columns by calling .items(), will actually cause all the columns to be cached in _item_cache, which blows up memory usage.

This might be tricky to do, though, as Joris noted, since this would be a behavior change.
We should discuss here how we want to go about doing this(needs deprecation?).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Needs DiscussionRequires discussion from core team before further actionPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions