-
-
Notifications
You must be signed in to change notification settings - Fork 18.9k
Description
Motivation: profiling https://github.com/pola-rs/tpch/tree/main/pandas_queries I saw a ton of time spent in frame[mask]
but what is actually used is more like frame[mask][small_subset_of_columns]
. If we did the frame[mask]
part lazily, we'd get a huge performance improvement in cases like this.
Discussion Question 1: How common are cases like this? What other cases could benefit from laziness?
Discussion Question 2: What would it take to implement laziness? Would this be worth it?
Most of this is brainstorming on these topics. No strong opinion yet.
Cases that benefit
df[mask]
mentioned above- operator fusing on arithmetic
(a + b*c - d) / e
can likely be optimized similar to how numexpr optimizes in pd.eval. - ... [fill in as they are suggested]
Implementation (triple tentative)
Could make Block.values
a cache_readonly
instead of an ArrayLike so it only gets evaluated when needed. Oftentimes we need .shape or .dtype before we need the actual values, and in some cases we can determine those ahead of time in O(1) time. This reminds me of issues/PRs I frequently see on the modin tracker, maybe @YarShev or @anmyachev can comment on the tradeoffs here?
Alternatively could add a new layer either between DataFrame/Manager or between Manager/Block. Extra layers can hurt perf-wise, but this might be the right move w/r/t separation of concerns.
One potential footgun would be something like df.where(mask, other)
if other
is mutable, we could get incorrect behavior if we delayed the evaluation and then other
was mutated.
cc @TomAugspurger any insights?