feat(expr-ir): Support `group_by` #3143

dangotbanned · 2025-09-18T18:51:34Z

What type of PR is this? (check all applicable)

✨ Feature

Related issues

Child of #feat(RFC): A richer Expr IR #2572

Tasks

Mapping things out a bit, no compliant yet

There's a few gaps, but overall surprised how much was reusable 🥳

lol didn't realise it was just describing python dict behavior

Everything here seems to be working already? 😱 May as well show it off ```py >>> df.group_by("a", nwp.nth(2, 8)).agg(nwp.mean("d", "e", "g").name.suffix("_mean")) NotImplementedError: TODO: `GroupBy.agg` needs a `CompliantGroupBy` to dispatch to: keys: (a=col('a'), c=col('c'), i=col('i')) aggs: (d_mean=col('d').mean(), e_mean=col('e').mean(), g_mean=col('g').mean()) result_schema: FrozenSchema([ ('a', String), ('c', Int64), ('i', Unknown), ('d_mean', Unknown), ('e_mean', Unknown), ('g_mean', Unknown), ]) ```

Quite different to current version(s)

oops https://github.com/narwhals-dev/narwhals/actions/runs/17838467107/job/50721552166?pr=3143

Gonna need space for the mini translator

Borrowing some ideas from #2528, #2680

woops Making it a separate node rather than having a flag https://github.com/pola-rs/polars/blob/cdd247aaba8db3332be0bd031e0f31bc3fc33f77/crates/polars-plan/src/dsl/mod.rs#L872-L889

https://github.com/narwhals-dev/narwhals/blob/0fb045536f5b56b978f354f8178b292301e9598c/narwhals/_arrow/group_by.py#L132-L141

#2660 (comment)

`pyarrow` has the same behavior as `polars`

Wasn't expecting so much to be working already 🥳 🥳 🥳

https://github.com/narwhals-dev/narwhals/actions/runs/17869832358/job/50820920705?pr=3143

Just pushing this as tests are working. Useful changes to follow: - Column renaming stuff will be avoidable - we just use `ArrowAggExpr.output_name` - Awkward stuff `first`, `last`, `_ensure_single_thread` can be avoided - `use_threads` was always available on `Declaration.to_table` - Whether we need to use can just be an `__ior__`

From #2528 https://github.com/narwhals-dev/narwhals/blob/0fb045536f5b56b978f354f8178b292301e9598c/tests/frame/group_by_test.py#L686-L781

`ArrowDataFrame.drop_nulls` is shorter and waaaaaaay more efficient than `main`

Didn't show up as an issue until trying to drop multiple keys Surprised pyarrow doesnt support this?

narwhals/_plan/_expansion.py

dangotbanned · 2025-09-23T13:43:40Z

narwhals/_plan/common.py

+class temp:  # noqa: N801
+    """Temporary mini namespace for temporary utils."""
+
+    _MAX_ITERATIONS: ClassVar[int] = 100
+
+    @classmethod
+    def column_name(
+        cls,
+        source: _StoresColumns | Iterable[str],
+        /,
+        *,
+        prefix: str = "nw",
+        n_chars: int = 16,
+    ) -> str:
+        """Generate a single, unique column name that is not present in `source`."""


TODO: Document + test all the temp stuff

Rushed this a bit since I needed temp.column_names.

Would be nice to have something to point to about why this:

nw.temp.column_name(df, prefix="count_")

is preferable to this:

subset = df.columns nw.generate_temporary_column_name(n_bytes=8, columns=subset, prefix="count_")

RE: #3147

dangotbanned · 2025-09-23T13:49:06Z

narwhals/_plan/dataframe.py

+    # NOTE: Want to be able to call `with_columns` at compliant level - and still get the right schema
+    # - Currently it acts like select in `group_by`
+    # - Doing some gymnastics to workaround for now
    def with_columns(self, *exprs: OneOrIterable[IntoExpr], **named_exprs: Any) -> Self:
        named_irs, _ = self._project(exprs, named_exprs, ExprContext.WITH_COLUMNS)
        return self._from_compliant(self._compliant.with_columns(named_irs))


Important
Don't merge back into (#2572) without addressing this

Mostly addressed in:

refactor: Refining schema projections

As in it'll do for now, but I really wish we could have something more like

https://github.com/pola-rs/polars/blob/9b430ebcffe49355578ba567b6706f3934dd0b03/crates/polars-plan/src/dsl/plan.rs

Added these rules before I realised it described `dict`

narwhals/_plan/group_by.py

#3143 (comment)

Still need to decide what to do with named tuple

…ojection` Can accomodate - any narwhals/compliant-level frame - any reasonable transformation on a schema - the result of `freeze_schema` can be passed *back* into itself for free

- Currently just replaces the narwhals-level bits - Next part is making use of them for compliant-level

https://github.com/narwhals-dev/narwhals/actions/runs/18021842115/job/51280865880

Tricky to do this in a way that doesn't end up duplicating lots of logic/layers

- Enough to pass the remaining tests - Hoping to reduce the complexity in `__iter__` next

Bug isnt present on `main`, probably need to redo the resolve stuff to fix

Didn't realize on main we store two `ArrowDataFrame`s https://github.com/narwhals-dev/narwhals/blob/63c5022e347cef3f821f725350cd9d39e6e476c6/narwhals/_arrow/group_by.py#L139 https://github.com/narwhals-dev/narwhals/blob/63c5022e347cef3f821f725350cd9d39e6e476c6/narwhals/_arrow/group_by.py#L158

dangotbanned added 10 commits September 15, 2025 18:19

feat(expr-ir): Getting started on GroupBy

4d33b68

Mapping things out a bit, no compliant yet

feat(DRAFT): mock up resolve_group_by

3718690

There's a few gaps, but overall surprised how much was reusable 🥳

fix: re-sync GroupByKeys

f70c021

feat: Make rewrite_projections(keys) optional

3828ea4

feat: Add FrozenSchema.merge

feb7661

lol didn't realise it was just describing python dict behavior

feat(DRAFT): Start spec-ing CompliantGroupBy

7a811b6

Quite different to current version(s)

feat(DRAFT): Implement some of ArrowGroupBy

6d3c0a9

feat(DRAFT): Fill out more of GroupBy.agg

e71d092

Merge branch 'oh-nodes' into expr-ir/group-by

9179ec3

dangotbanned added the enhancement New feature or request label Sep 18, 2025

dangotbanned mentioned this pull request Sep 18, 2025

feat(RFC): A richer Expr IR #2572

Draft

60 tasks

fix: avoid typing_extensions import

8aaf9a9

oops https://github.com/narwhals-dev/narwhals/actions/runs/17838467107/job/50721552166?pr=3143

dangotbanned added the internal label Sep 18, 2025

dangotbanned added 16 commits September 18, 2025 20:40

refactor: Move ArrowGroupBy

d9b918f

Gonna need space for the mini translator

feat(DRAFT): Simple cases working?

767261c

Borrowing some ideas from #2528, #2680

feat(expr-ir): Add missing Expr.len

648d5d9

woops Making it a separate node rather than having a flag https://github.com/pola-rs/polars/blob/cdd247aaba8db3332be0bd031e0f31bc3fc33f77/crates/polars-plan/src/dsl/mod.rs#L872-L889

feat(expr-ir): Support nw.len()

2682b10

https://github.com/narwhals-dev/narwhals/blob/0fb045536f5b56b978f354f8178b292301e9598c/narwhals/_arrow/group_by.py#L132-L141

feat(expr-ir): support auto-implode

e1c3145

#2660 (comment)

feat(DRAFT): Support nw.col("a").unique() in group_by

5d36607

`pyarrow` has the same behavior as `polars`

test: Port over tests/frame/group_by_test

1aa2464

Wasn't expecting so much to be working already 🥳 🥳 🥳

cov

45a816f

https://github.com/narwhals-dev/narwhals/actions/runs/17869832358/job/50820920705?pr=3143

chore: Update todo

8f2ad50

chore: Add todo for drop_null_keys=True

4650456

fix: Avoid shadowed output aggregation names

ce18f51

feat(expr-ir): Rewrite, fix ordered aggregations

16148b2

test: Port over first, last group_by tests

ce86f8f

From #2528 https://github.com/narwhals-dev/narwhals/blob/0fb045536f5b56b978f354f8178b292301e9598c/tests/frame/group_by_test.py#L686-L781

test: Add failing drop_null_keys, __iter__ tests

581e511

feat(expr-ir): Support group_by(drop_null_keys=True)

4b77500

`ArrowDataFrame.drop_nulls` is shorter and waaaaaaay more efficient than `main`

dangotbanned added 6 commits September 22, 2025 16:21

feat: Add temp column name utils

8b622d4

refactor: Replace temp naming stuff

cad1a47

test: Add Expr.unique group_by tests

57c3e6d

fix: Use operator.or_ instead of pyarrow.compute.or_

87cd4a8

Didn't show up as an issue until trying to drop multiple keys Surprised pyarrow doesnt support this?

test: Steal some of the polars test suite 😉

d49bcce

test: df.group_by(**named_by)

9d72311

dangotbanned added the tests label Sep 23, 2025

dangotbanned added 2 commits September 23, 2025 13:23

revert: Don't introduce unused type var

479eee6

chore: Remove completed todo

4391a6f

dangotbanned commented Sep 23, 2025

View reviewed changes

narwhals/_plan/_expansion.py Outdated Show resolved Hide resolved

dangotbanned commented Sep 23, 2025

View reviewed changes

docs: Trim Schema.merge

668db86

Added these rules before I realised it described `dict`

dangotbanned commented Sep 23, 2025

View reviewed changes

narwhals/_plan/group_by.py Outdated Show resolved Hide resolved

dangotbanned added 5 commits September 23, 2025 14:15

chore: Clean up unused in arrow.group_by

ad4babd

test: Add test_group_by_exclude_keys

a940e05

refactor: Tweak prepare_excluded

3a0617d

#3143 (comment)

refactor: Refining schema projections

28601df

refactor: Clean up resolve_group_by a bit

6122245

Still need to decide what to do with named tuple

dangotbanned mentioned this pull request Sep 25, 2025

Allow any length-preserving Expr in group_by(*keys) #3151

Open

dangotbanned added 10 commits September 25, 2025 15:02

perf: Skip synthesized FrozenSchema.__init__

8d1220d

feat: Accept more in freeze_schema, IntoFrozenSchema, `prepare_pr…

6dcfa4f

…ojection` Can accomodate - any narwhals/compliant-level frame - any reasonable transformation on a schema - the result of `freeze_schema` can be passed *back* into itself for free

refactor(DRAFT): Add Grouper/Resolver concepts

d7cf2d6

- Currently just replaces the narwhals-level bits - Next part is making use of them for compliant-level

refactor: Move loads of stuff up from arrow

3d0670b

😠😠😠

a234cc7

https://github.com/narwhals-dev/narwhals/actions/runs/18021842115/job/51280865880

refactor: Define group_by_agg

9d17006

Tricky to do this in a way that doesn't end up duplicating lots of logic/layers

feat(DRAFT): Almost direct port of ArrowGroupBy.__iter__

8eb5db0

- Enough to pass the remaining tests - Hoping to reduce the complexity in `__iter__` next

test: Add (failing) test_group_by_expr_iter

890732e

Bug isnt present on `main`, probably need to redo the resolve stuff to fix

chore: Tidy up some comments/notes/docs

5b1fa00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(expr-ir): Support `group_by` #3143

feat(expr-ir): Support `group_by` #3143

Uh oh!

dangotbanned commented Sep 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

dangotbanned Sep 23, 2025 •

edited

Loading

Uh oh!

dangotbanned Sep 23, 2025

Uh oh!

dangotbanned Sep 26, 2025

Uh oh!

Uh oh!

Uh oh!

feat(expr-ir): Support group_by #3143

Are you sure you want to change the base?

feat(expr-ir): Support group_by #3143

Uh oh!

Conversation

dangotbanned commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this? (check all applicable)

Related issues

Tasks

Uh oh!

Uh oh!

dangotbanned Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

TODO: Document + test all the temp stuff

Uh oh!

dangotbanned Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

dangotbanned Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

feat(expr-ir): Support `group_by` #3143

feat(expr-ir): Support `group_by` #3143

dangotbanned commented Sep 18, 2025 •

edited

Loading

dangotbanned Sep 23, 2025 •

edited

Loading

TODO: Document + test all the `temp` stuff