Skip to content

DISC/PERF: Laziness #52980

@jbrockmendel

Description

@jbrockmendel

Motivation: profiling https://github.com/pola-rs/tpch/tree/main/pandas_queries I saw a ton of time spent in frame[mask] but what is actually used is more like frame[mask][small_subset_of_columns]. If we did the frame[mask] part lazily, we'd get a huge performance improvement in cases like this.

Discussion Question 1: How common are cases like this? What other cases could benefit from laziness?

Discussion Question 2: What would it take to implement laziness? Would this be worth it?

Most of this is brainstorming on these topics. No strong opinion yet.


Cases that benefit

  1. df[mask] mentioned above
  2. operator fusing on arithmetic (a + b*c - d) / e can likely be optimized similar to how numexpr optimizes in pd.eval.
  3. ... [fill in as they are suggested]

Implementation (triple tentative)
Could make Block.values a cache_readonly instead of an ArrayLike so it only gets evaluated when needed. Oftentimes we need .shape or .dtype before we need the actual values, and in some cases we can determine those ahead of time in O(1) time. This reminds me of issues/PRs I frequently see on the modin tracker, maybe @YarShev or @anmyachev can comment on the tradeoffs here?

Alternatively could add a new layer either between DataFrame/Manager or between Manager/Block. Extra layers can hurt perf-wise, but this might be the right move w/r/t separation of concerns.

One potential footgun would be something like df.where(mask, other) if other is mutable, we could get incorrect behavior if we delayed the evaluation and then other was mutated.

cc @TomAugspurger any insights?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Needs DiscussionRequires discussion from core team before further actionPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions