-
Notifications
You must be signed in to change notification settings - Fork 166
feat(expr-ir): Support group_by
#3143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: oh-nodes
Are you sure you want to change the base?
Conversation
Mapping things out a bit, no compliant yet
There's a few gaps, but overall surprised how much was reusable 🥳
lol didn't realise it was just describing python dict behavior
Everything here seems to be working already? 😱 May as well show it off ```py >>> df.group_by("a", nwp.nth(2, 8)).agg(nwp.mean("d", "e", "g").name.suffix("_mean")) NotImplementedError: TODO: `GroupBy.agg` needs a `CompliantGroupBy` to dispatch to: keys: (a=col('a'), c=col('c'), i=col('i')) aggs: (d_mean=col('d').mean(), e_mean=col('e').mean(), g_mean=col('g').mean()) result_schema: FrozenSchema([ ('a', String), ('c', Int64), ('i', Unknown), ('d_mean', Unknown), ('e_mean', Unknown), ('g_mean', Unknown), ]) ```
Quite different to current version(s)
Gonna need space for the mini translator
woops Making it a separate node rather than having a flag https://github.com/pola-rs/polars/blob/cdd247aaba8db3332be0bd031e0f31bc3fc33f77/crates/polars-plan/src/dsl/mod.rs#L872-L889
`pyarrow` has the same behavior as `polars`
Wasn't expecting so much to be working already 🥳 🥳 🥳
Just pushing this as tests are working. Useful changes to follow: - Column renaming stuff will be avoidable - we just use `ArrowAggExpr.output_name` - Awkward stuff `first`, `last`, `_ensure_single_thread` can be avoided - `use_threads` was always available on `Declaration.to_table` - Whether we need to use can just be an `__ior__`
`ArrowDataFrame.drop_nulls` is shorter and waaaaaaay more efficient than `main`
Didn't show up as an issue until trying to drop multiple keys Surprised pyarrow doesnt support this?
class temp: # noqa: N801 | ||
"""Temporary mini namespace for temporary utils.""" | ||
|
||
_MAX_ITERATIONS: ClassVar[int] = 100 | ||
|
||
@classmethod | ||
def column_name( | ||
cls, | ||
source: _StoresColumns | Iterable[str], | ||
/, | ||
*, | ||
prefix: str = "nw", | ||
n_chars: int = 16, | ||
) -> str: | ||
"""Generate a single, unique column name that is not present in `source`.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Document + test all the temp
stuff
Rushed this a bit since I needed temp.column_names
.
Would be nice to have something to point to about why this:
nw.temp.column_name(df, prefix="count_")
is preferable to this:
subset = df.columns
nw.generate_temporary_column_name(n_bytes=8, columns=subset, prefix="count_")
RE: #3147
narwhals/_plan/dataframe.py
Outdated
# NOTE: Want to be able to call `with_columns` at compliant level - and still get the right schema | ||
# - Currently it acts like select in `group_by` | ||
# - Doing some gymnastics to workaround for now | ||
def with_columns(self, *exprs: OneOrIterable[IntoExpr], **named_exprs: Any) -> Self: | ||
named_irs, _ = self._project(exprs, named_exprs, ExprContext.WITH_COLUMNS) | ||
return self._from_compliant(self._compliant.with_columns(named_irs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Don't merge back into (#2572) without addressing this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly addressed in:
As in it'll do for now, but I really wish we could have something more like
Added these rules before I realised it described `dict`
Still need to decide what to do with named tuple
…ojection` Can accomodate - any narwhals/compliant-level frame - any reasonable transformation on a schema - the result of `freeze_schema` can be passed *back* into itself for free
- Currently just replaces the narwhals-level bits - Next part is making use of them for compliant-level
Tricky to do this in a way that doesn't end up duplicating lots of logic/layers
- Enough to pass the remaining tests - Hoping to reduce the complexity in `__iter__` next
Bug isnt present on `main`, probably need to redo the resolve stuff to fix
What type of PR is this? (check all applicable)
Related issues
Expr
IR #2572Tasks
tests/frame/group_by_test
test_double_same_aggregation
test_group_by_len_1_column
group_by(drop_null_keys=True)
drop_null_keys
withExpr
Expr
keystest_group_by_expr
case threadGroupBy.__iter__
ArrowGroupBy.__iter__
__iter__
Expr.unique
group_by testsfirst
,last
group_by testscol
)df.group_by(**named_by)
ArrowAggExpr
ArrowGroupBy
by_named_irs
may be reusable for everything but polars