Skip to content

Conversation

@FBruzzesi
Copy link
Member

@FBruzzesi FBruzzesi commented Jul 16, 2025

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

@FBruzzesi FBruzzesi added internal pandas-like Issue is related to pandas-like backends labels Jul 16, 2025
@FBruzzesi FBruzzesi added pyarrow Issue is related to pyarrow backend polars Issue is related to polars backend labels Jul 17, 2025
@FBruzzesi FBruzzesi changed the title refactor(WIP): Simplify pandas-like Series.hist refactor: Simplify pandas-like Series.hist Jul 17, 2025
@FBruzzesi FBruzzesi marked this pull request as ready for review July 17, 2025 21:30
@FBruzzesi
Copy link
Member Author

Ok I am tagging it as ready for review, but I am cheating on type hints.

Will drop some inline comments for the daily struggle with them 😭

@FBruzzesi FBruzzesi changed the title refactor: Simplify pandas-like Series.hist refactor: Simplify compliant Series.hist Jul 17, 2025
Comment on lines 1032 to 1037
elif pc.sum(
pc.invert(pc.or_(pc.is_nan(self.native), pc.is_null(self.native))).cast(
pa.uint8()
pa.uint64()
),
min_count=0,
) == pa.scalar(0, type=pa.uint64()):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I tried to put pc.sum(...) into its own method, but mypy kept complaining that the return type should have been pa.Scalar[...] where ... was Uint8Type which does not exist? at least I could not find it in the docs, nor in pa namespace, not pa.dtypes
  2. A different cheat would have been to return after moving it to a python int via .as_py()
  3. The point above would also make the comparison simpler - I don't know how it was not complaining before, but now comparing uint8 with uint64 was raising another issue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to put pc.sum(...) into its own method, but mypy kept complaining that the return type should have been pa.Scalar[...] where ... was Uint8Type which does not exist? at least I could not find it in the docs, nor in pa namespace, not pa.dtypes

This was corrected in a more recent version of pyarrow-stubs (zen-xu/pyarrow-stubs#230)

I've been putting off upgrading our pinned version, as:

bin_right: Sequence[int | float | pa.Scalar[Any]] | np.typing.ArrayLike

data_count = pc.sum(
data: dict[str, Any]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know where to start with this. My best guess was dict[str, Sequence[int | float] | _1DArray | ArrayAny] but still not a happy ending from _hist_from_bins below

upper += 0.5

width = (upper - lower) / bin_count
bins = pl.int_range(0, bin_count + 1, eager=True) * width + lower # type: ignore[assignment]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous version was casting to list, which is a bit of a missed opportunity to keep everything native

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you thought about reimplementing the whole of the polars one using expressions?

I think this is the most complicated Series method and I'd guess could benefit the most performance-wise 🫣

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really! As the main issue with polars is "just" managing all the different versions before and after the call to .hist(...), I didn't invest much time on it 🙈

"""Prepare bins based on backend version compatibility.
polars <1.15 does not adjust the bins when they have equivalent min/max
polars <1.5 with bin_count=...
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused by this line comment, but I kept it as it was

@dangotbanned dangotbanned self-requested a review July 18, 2025 09:17
- If this were only used in the `pandas` case, it is just more code
- But we can reuse large chunks of it and extract others into new public api
@dangotbanned
Copy link
Member

dangotbanned commented Jul 21, 2025

Hey @FBruzzesi 👋

I've been experimenting with PandasLikeSeries.hist and I think there could be something to this?

refactor(suggestion): Try out generalizing everything

So we should be able to add:

  • nw.linear_space
    • Both pandas and pyarrow are already using numpy for this
  • nw.Series.cut
    • Assuming that pyarrow is utilizing numpy to do this currently
  • nw.zeros (and by extension, nw.repeat)
    • pandas uses numpy
    • pyarrow can do both natively

And there's a lot in there that I think could either move up into:

  • nw.Series.hist (for the empty cases)
  • Some shared compliant class (for the general "algorithm" 😄)

But most importantly, this is only a suggestion that came out of finally understanding hist 😍

@FBruzzesi
Copy link
Member Author

Hey @dangotbanned thanks for suggesting #2839 (comment)

From a first look, I really like it. I have a few open questions also considering the Discord exchange we had about hist:

  • Do you think this is better/more extensible if we were to eventually include also Expr.hist? Consider that the return type is a struct in such case.
  • Do you feel like that the current lack of features, such as nw.linear_space and others, are a blocker?
  • On the other hand, if we are not going to develop Expr.hist, would this be an overkill?
  • How should we decide what to keep at the narwhals/compliant level and what to move to this class?

@dangotbanned
Copy link
Member

dangotbanned commented Jul 23, 2025

#2839 (comment)

Thanks for all the great questions 😍!

  1. Do you think this is better/more extensible if we were to eventually include also Expr.hist? Consider that the return type is a struct in such case.

My eyes are about fall out - but for just this part - I was thinking the shortest path to Expr.hist would be something like this?

At least for pyarrow it is fairly easy, I'd guess pandas would need to defer to the same thing as well?

import polars as pl

import narwhals as nw

data = {"a": [1, 3, 8, 8, 2, 1, 3]}
bins: list[float] = [1, 2, 3]
df = pl.DataFrame(data)
nw_ser = nw.from_native(df).get_column("a")
nw_hist = nw_ser.hist(bins, include_breakpoint=True)

>>> nw_hist
┌──────────────────────┐
|  Narwhals DataFrame  |
|----------------------|
|shape: (2, 2)         |
|┌────────────┬───────┐|
|breakpointcount|
|------|
|f64u32|
|╞════════════╪═══════╡|
|2.03|
|3.02|
|└────────────┴───────┘|
└──────────────────────┘


pl_hist_expr = df.select(pl.col("a").hist(bins, include_breakpoint=True))

nw_pl_hist_struct = nw.from_native(
    nw_hist.to_polars().to_struct("a"), series_only=True
).to_frame()

nw_pa_hist_struct = (
    nw.from_native(nw_hist.to_arrow().to_struct_array(), series_only=True)
    .alias("a")
    .to_frame()
)

>>> pl_hist_expr, nw_pl_hist_struct, nw_pa_hist_struct
(shape: (2, 1)
 ┌───────────┐
 │ a         │
 │ ---       │
 │ struct[2] │
 ╞═══════════╡
 │ {2.0,3}   │
 │ {3.0,2}   │
 └───────────┘,
 ┌──────────────────┐
 |Narwhals DataFrame|
 |------------------|
 |  shape: (2, 1)   |
 |  ┌───────────┐   |
 |a|
 |---|
 |struct[2] │   |
 |  ╞═══════════╡   |
 |  │ {2.0,3}   │   |
 |  │ {3.0,2}   │   |
 |  └───────────┘   |
 └──────────────────┘,
 ┌────────────────────────────────────────────┐
 |             Narwhals DataFrame             |
 |--------------------------------------------|
 |pyarrow.Table                               |
 |a: struct<breakpoint: double, count: uint32>|
 |  child 0, breakpoint: double               |
 |  child 1, count: uint32                    |
 |----                                        |
 |a: [                                        |
 |  -- is_valid: all not null                 |
 |  -- child 0 type: double                   |
 |[2,3]                                       |
 |  -- child 1 type: uint32                   |
 |[3,2]]                                      |
 └────────────────────────────────────────────┘)

@dangotbanned
Copy link
Member

Hey @dangotbanned thanks for suggesting #2839 (comment)

From a first look, I really like it. I have a few open questions also considering the Discord exchange we had about hist:

Thanks @FBruzzesi!

Okay easiest one out of the way first:

  • Do you feel like that the current lack of features, such as nw.linear_space and others, are a blocker?

To merging this PR? Absolutely not! 😄

I have a few thoughts on the polars bits that look suspiciously pandas-esque, but if you're happy with the PR - then I think we've got plenty of improvements to merge already!


If you meant, are things like nw.linear_space a blocker for adopting narwhals?
Probably not, but I've been noticing np.linspace being used downstream (code search)

I think it would be a great candidate for exposing in narwhals to avoid the numpy dependency when possible, and provide the better polars performance that's on table 🙂

@dangotbanned
Copy link
Member

dangotbanned commented Jul 24, 2025

Expr.hist

  • Do you think this is better/more extensible if we were to eventually include also Expr.hist? Consider that the return type is a struct in such case.
  • On the other hand, if we are not going to develop Expr.hist, would this be an overkill?

I guess I'm viewing adding Expr.hist as unrelated to these changes, so I'll explain my thought process a bit more 😅

The path I see to that is:

pandas, pyarrow

They can do something like (#2839 (comment)), which leaves their Series implementation unchanged.

I'm cutting a lot of corners to get here (including the public API), but this is a quick-and-dirty one for ArrowExpr

Note

There are no changes needed in either ArrowExpr or ArrowSeries

Arrow version diff

diff --git a/narwhals/_arrow/dataframe.py b/narwhals/_arrow/dataframe.py
index 120d11f3f..a160d4311 100644
--- a/narwhals/_arrow/dataframe.py
+++ b/narwhals/_arrow/dataframe.py
@@ -466,6 +466,11 @@ class ArrowDataFrame(
             return {ser.name: ser for ser in it}
         return {ser.name: ser.to_list() for ser in it}
 
+    def to_struct(self, name: str = "") -> ArrowSeries:
+        return ArrowSeries.from_native(
+            self.native.to_struct_array(), context=self, name=name
+        )
+
     def with_row_index(self, name: str, order_by: Sequence[str] | None) -> Self:
         plx = self.__narwhals_namespace__()
         if order_by is None:
diff --git a/narwhals/_compliant/dataframe.py b/narwhals/_compliant/dataframe.py
index d93b11b71..583871514 100644
--- a/narwhals/_compliant/dataframe.py
+++ b/narwhals/_compliant/dataframe.py
@@ -486,3 +486,5 @@ class EagerDataFrame(
                 assert_never(rows)
 
         return compliant
+
+    def to_struct(self, name: str = "") -> EagerSeriesT: ...
diff --git a/narwhals/_compliant/expr.py b/narwhals/_compliant/expr.py
index 8e6e8b5fb..0c78def12 100644
--- a/narwhals/_compliant/expr.py
+++ b/narwhals/_compliant/expr.py
@@ -511,6 +511,28 @@ class EagerExpr(
     def cast(self, dtype: IntoDType) -> Self:
         return self._reuse_series("cast", dtype=dtype)
 
+    def hist_from_bins(self, bins: list[float], *, include_breakpoint: bool) -> Self:
+        method_name = "hist_from_bins"
+
+        def fn(df: EagerDataFrameT) -> Sequence[EagerSeriesT]:
+            series_in = self(df)
+            if len(series_in) > 1:
+                raise NotImplementedError
+            df_result = series_in[0].hist_from_bins(
+                bins=bins, include_breakpoint=include_breakpoint
+            )
+            aliases = self._evaluate_aliases(df)
+            return [df_result.to_struct(aliases[0])]
+
+        return self._from_callable(
+            fn,
+            depth=self._depth + 1,
+            function_name=f"{self._function_name}->{method_name}",
+            evaluate_output_names=self._evaluate_output_names,
+            alias_output_names=self._alias_output_names,
+            context=self,
+        )
+
     def __eq__(self, other: Self | Any) -> Self:  # type: ignore[override]
         return self._reuse_series("__eq__", other=other)
 
diff --git a/narwhals/_compliant/series.py b/narwhals/_compliant/series.py
index 22f2f5f5a..4324fb503 100644
--- a/narwhals/_compliant/series.py
+++ b/narwhals/_compliant/series.py
@@ -34,7 +34,7 @@ if TYPE_CHECKING:
     import pyarrow as pa
     from typing_extensions import Self
 
-    from narwhals._compliant.dataframe import CompliantDataFrame
+    from narwhals._compliant.dataframe import CompliantDataFrame, EagerDataFrame
     from narwhals._compliant.expr import CompliantExpr, EagerExpr
     from narwhals._compliant.namespace import CompliantNamespace, EagerNamespace
     from narwhals._utils import Implementation, Version, _LimitedContext
@@ -350,6 +350,13 @@ class EagerSeries(CompliantSeries[NativeSeriesT], Protocol[NativeSeriesT]):
         else:
             assert_never(item)
 
+    @unstable
+    def hist_from_bins(
+        self, bins: list[float], *, include_breakpoint: bool
+    ) -> EagerDataFrame[Self, Any, Any, NativeSeriesT]:
+        """`Series.hist(bins=..., bin_count=None)`."""
+        ...
+
     @property
     def str(self) -> EagerSeriesStringNamespace[Self, NativeSeriesT]: ...
     @property
diff --git a/narwhals/expr.py b/narwhals/expr.py
index 595fde8c1..0f278ffaf 100644
--- a/narwhals/expr.py
+++ b/narwhals/expr.py
@@ -2563,6 +2563,15 @@ class Expr:
         """
         return self._with_elementwise(lambda plx: self._to_compliant_expr(plx).sqrt())
 
+    def hist_from_bins(
+        self, bins: list[float], *, include_breakpoint: bool = True
+    ) -> Self:
+        return self._with_orderable_filtration(
+            lambda plx: self._to_compliant_expr(plx).hist_from_bins(  # type: ignore[attr-defined]
+                bins=bins, include_breakpoint=include_breakpoint
+            )
+        )
+
     @property
     def str(self) -> ExprStringNamespace[Self]:
         return ExprStringNamespace(self)

And that looks like:

import narwhals as nw
import pyarrow as pa

data: dict[str, Any] = {"a": [1, 3, 8, 8, 2, 1, 3]}
bins: list[float] = [1, 2, 3]

df = pa.table(data)
nw_df = nw.from_native(df)
>>> nw_df.select(nw.col("a").hist_from_bins(bins)) # ignore the name for now!
┌──────────────────────────────────────────┐
|            Narwhals DataFrame            |
|------------------------------------------|
|pyarrow.Table                             |
|a: struct<breakpoint: int64, count: int64>|
|  child 0, breakpoint: int64              |
|  child 1, count: int64                   |
|----                                      |
|a: [                                      |
|  -- is_valid: all not null               |
|  -- child 0 type: int64                  |
|[2,3]                                     |
|  -- child 1 type: int64                  |
|[3,2]]                                    |
└──────────────────────────────────────────┘

polars

For this, I would rip out the version branching from PolarsSeries.hist, and use it for PolarsExpr.hist.

PolarsSeries.hist would work in the opposite direction to the eager-only ones.
Pretty much the same idea as (#2598 (comment)). We should be doing more of that anyway 🙂

@dangotbanned
Copy link
Member

How should we decide what to keep at the narwhals/compliant level and what to move to this class?

(refactor(suggestion): Try out generalizing everything) collects everything into that class to make the diff easier 😅

I think these parts are the same across backends, so to narwhals-level

def with_bins(self, bins: list[float], /) -> Self:
if len(bins) <= 1:
self._data = self.data_empty()
elif self.is_empty_series():

def with_bin_count(self, bin_count: int, /) -> Self:
if bin_count == 0:
self._data = self.data_empty()
elif self.is_empty_series():

Anywhere I left a note like this, could become that method:

# NOTE: Based on `pl.Expr.cut`
def _cut(

# NOTE: Roughly `pl.linear_space`, but missing the `pd.Series` wrapping
def _linear_space(

Things like this are dependent on the above

# NOTE: *Could* be handled at narwhals-level, **iff** we add `nw.repeat`, `nw.linear_space`
# See https://github.com/narwhals-dev/narwhals/pull/2839#discussion_r2215630696
def series_empty(self, arg: int | list[float], /) -> _HistData:

I think a good example of splitting things across layers to implement is the work we mostly Marco did on slicing.

Its all about trying to spot the cases where we're repeating ourselves 😄

DataFrame.__getitem__

from narwhals.series import Series
msg = (
f"Unexpected type for `DataFrame.__getitem__`, got: {type(item)}.\n\n"
"Hints:\n"
"- use `df.item` to select a single item.\n"
"- Use `df[indices, :]` to select rows positionally.\n"
"- Use `df.filter(mask)` to filter rows based on a boolean mask."
)
if isinstance(item, tuple):
if len(item) > 2:
tuple_msg = (
"Tuples cannot be passed to DataFrame.__getitem__ directly.\n\n"
"Hint: instead of `df[indices]`, did you mean `df[indices, :]`?"
)
raise TypeError(tuple_msg)
rows = None if not item or is_slice_none(item[0]) else item[0]
columns = None if len(item) < 2 or is_slice_none(item[1]) else item[1]
if rows is None and columns is None:
return self
elif is_index_selector(item):
rows = item
columns = None
elif is_sequence_like(item) or isinstance(item, (slice, str)):
rows = None
columns = item
else:
raise TypeError(msg)
if isinstance(rows, str):
raise TypeError(msg)
compliant = self._compliant_frame
if isinstance(columns, (int, str)):
if isinstance(rows, int):
return self.item(rows, columns)
col_name = columns if isinstance(columns, str) else self.columns[columns]
series = self.get_column(col_name)
return series[rows] if rows is not None else series
if isinstance(rows, Series):
rows = rows._compliant_series
if isinstance(columns, Series):
columns = columns._compliant_series
if rows is None:
return self._with_compliant(compliant[:, columns])
if columns is None:
return self._with_compliant(compliant[rows, :])
return self._with_compliant(compliant[rows, columns])

EagerDataFrame.__getitem__

def _gather(self, rows: SizedMultiIndexSelector[NativeSeriesT]) -> Self: ...
def _gather_slice(self, rows: _SliceIndex | range) -> Self: ...
def _select_multi_index(
self, columns: SizedMultiIndexSelector[NativeSeriesT]
) -> Self: ...
def _select_multi_name(
self, columns: SizedMultiNameSelector[NativeSeriesT]
) -> Self: ...
def _select_slice_index(self, columns: _SliceIndex | range) -> Self: ...
def _select_slice_name(self, columns: _SliceName) -> Self: ...
def __getitem__( # noqa: C901, PLR0912
self,
item: tuple[
SingleIndexSelector | MultiIndexSelector[EagerSeriesT],
MultiColSelector[EagerSeriesT],
],
) -> Self:
rows, columns = item
compliant = self
if not is_slice_none(columns):
if isinstance(columns, Sized) and len(columns) == 0:
return compliant.select()
if is_index_selector(columns):
if is_slice_index(columns) or is_range(columns):
compliant = compliant._select_slice_index(columns)
elif is_compliant_series(columns):
compliant = self._select_multi_index(columns.native)
else:
compliant = compliant._select_multi_index(columns)
elif isinstance(columns, slice):
compliant = compliant._select_slice_name(columns)
elif is_compliant_series(columns):
compliant = self._select_multi_name(columns.native)
elif is_sequence_like(columns):
compliant = self._select_multi_name(columns)
else:
assert_never(columns)
if not is_slice_none(rows):
if isinstance(rows, int):
compliant = compliant._gather([rows])
elif isinstance(rows, (slice, range)):
compliant = compliant._gather_slice(rows)
elif is_compliant_series(rows):
compliant = compliant._gather(rows.native)
elif is_sized_multi_index_selector(rows):
compliant = compliant._gather(rows)
else:
assert_never(rows)
return compliant

Copy link
Member

@dangotbanned dangotbanned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks x7 @FBruzzesi!

I've got two very minor bits on the polars side.

As mentioned in (#2839 (comment)) - I'm happy for everything else to be follow-ups

count narwhal

dangotbanned added a commit that referenced this pull request Jul 25, 2025
I wanna move this back into `pyarrow` anyway

#2839 (comment)
* refactor: add compliant level parent class

* fix(typing): remove unused ignores

* fix(typing): Use `pa.Int64Array`

* fix(typing): `Generic` -> `Protocol`

#2882 (comment)

* fix(typing): Resolve most invariance issues

#2882 (comment)

* chore(typing): Ignore `linspace` for now

I wanna move this back into `pyarrow` anyway

#2839 (comment)

* docs(typing): Explain why `cast` bins

* move histdata into type-checking block

* chore(typing): `CompliantSeries` -> `EagerSeries`

* chore(typing): `Any` -> `EagerDataFrameAny`

* docs(typing): note constructor issue

this one gets me every time

* rm '_' prefix, stay native in _linear_space

---------

Co-authored-by: dangotbanned <125183946+dangotbanned@users.noreply.github.com>
Copy link
Member

@dangotbanned dangotbanned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing everything in (#2839 (review)) @FBruzzesi

hungry narwhal

@FBruzzesi FBruzzesi merged commit 79373f9 into main Jul 26, 2025
33 checks passed
@FBruzzesi FBruzzesi deleted the refactor/simplify-hist branch July 26, 2025 15:26
@dangotbanned dangotbanned mentioned this pull request Sep 10, 2025
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

internal pandas-like Issue is related to pandas-like backends polars Issue is related to polars backend pyarrow Issue is related to pyarrow backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Simplify *Series.hist

4 participants