DISC/PERF: Laziness

Motivation: profiling https://github.com/pola-rs/tpch/tree/main/pandas_queries I saw a ton of time spent in `frame[mask]` but what is actually used is more like `frame[mask][small_subset_of_columns]`.  If we did the `frame[mask]` part lazily, we'd get a huge performance improvement in cases like this.

Discussion Question 1: How common are cases like this?  What other cases could benefit from laziness?

Discussion Question 2: What would it take to implement laziness?  Would this be worth it?

Most of this is brainstorming on these topics.  No strong opinion yet.

-----
Cases that benefit
1) `df[mask]` mentioned above
2) operator fusing on arithmetic `(a + b*c - d) / e` can likely be optimized similar to how numexpr optimizes in pd.eval.
3) ... [fill in as they are suggested]

Implementation (triple tentative)
Could make `Block.values` a `cache_readonly` instead of an ArrayLike so it only gets evaluated when needed.  Oftentimes we need .shape or .dtype before we need the actual values, and in some cases we can determine those ahead of time in O(1) time.  This reminds me of issues/PRs I frequently see on the modin tracker, maybe @yarshev or @anmyachev can comment on the tradeoffs here?

Alternatively could add a new layer either between DataFrame/Manager or between Manager/Block.  Extra layers can hurt perf-wise, but this might be the right move w/r/t separation of concerns.

One potential footgun would be something like `df.where(mask, other)` if `other` is mutable, we could get incorrect behavior if we delayed the evaluation and then `other` was mutated.

cc @TomAugspurger any insights?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

DISC/PERF: Laziness #52980

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

DISC/PERF: Laziness #52980

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions