fix: Ensure Polars wrapped in Narwhals can be passed to Joblib #2451

MarcoGorelli · 2025-04-29T07:29:48Z

closes #2450

What type of PR is this? (check all applicable)

Related issues

Related issue #<issue number>
Closes #<issue number>

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below

dangotbanned · 2025-04-29T11:04:20Z

narwhals/_polars/dataframe.py

+# DataFrame methods where PolarsDataFrame just defers to Polars.DataFrame
+# directly.
+INHERITED_METHODS = frozenset(
+    [
+        "clone",
+        "drop_nulls",
+        "estimated_size",
+        "explode",
+        "filter",
+        "gather_every",
+        "head",
+        "is_unique",
+        "item",
+        "iter_rows",
+        "join_asof",
+        "rename",
+        "row",
+        "rows",
+        "sample",
+        "select",
+        "sort",
+        "tail",
+        "to_arrow",
+        "to_pandas",
+        "unique",
+        "with_columns",
+        "write_csv",
+        "write_parquet",
+    ]
+)
+


Thanks @MarcoGorelli, I've got two suggestions.

Wouldn't we also need this for all the other Polars* classes that use __getattr__ as a proxy?
I'm thinking a search for : Method[ would highlight which ones - but also the (Series|Expr)*Namespace classes

If we're defining all the cases up front, have you thought about using a descriptor instead?
__getattr__ makes sense at the moment because we're theoretically allowing things outside of the narwhals API to pass through.
In practice, I understand that might not be possible from the user-side - but it could mean we can just rely on python to raise the AttributeError instead of checking and raising ourselves 🤔

Thanks @dangotbanned !

I don't think is a concern, as users work with DataFrame / LazyFrame / Series structures. Nobody would make a function which accepts series.dt as an input :)

I'm not familiar with descriptors, but open to suggestions

Thanks @dangotbanned !

I don't think is a concern, as users work with DataFrame / LazyFrame / Series structures. Nobody would make a function which accepts series.dt as an input :)

Sounds good 👍

I'm not familiar with descriptors, but open to suggestions

If you're looking to include this PR in a bugfix - we can punt this to another issue?

Some useful homework though 😉

datamodel>Class instances

object.__getattr__

datamodel>descriptors

object.__dict__

In short, I think we can reduce the performance overhead of always looking up (and failing) in order to reach __getattr__ - since you've defined the full set of methods that are permitted.
So the idea would be to move a cost that is currently paid per-instance and per-method into one that is per-type for the fixed set of methods 😏

Pinging @camriddell to check if what I'm saying seems legit 😄

sure, thanks!

After looking at this might I propose the use of a decorator to dynamically add these methods to the class and then do away with the dynamic look up/attribute fetching via __getattr__ altogether?

The pattern here typically looks like

def define_attributes(attributes): def decorator(cls): for attr in attributes: setattr(cls, attr, ...) # insert dynamically defined function here return cls return decorator @define_attributes(['min', 'max']) class T: pass t = T() print(f'{t.min = }') # t.min = Ellipsis

Whereas the descriptor pattern may look like:

class PassThrough: def __set_name__(self, owner, name): self._name = name return name def __get__(self, instance, owner): # func = getattr(..., self._name) # get the underlying function if instance is None: return ... # special return for class look ups return ... # return for instance look ups class T: min = PassThrough() t = T() print(f'{t.min = }')

The main tradeoffs between these two

The decorator pattern is fairly declarative (passing attribute names as a list[str]) and the logic may be easier to follow

The descriptor pattern is very declarative; but repetitive and a big "magic"

The nice thing about either approach is that is enhances discoverability (finding available attribute via dir [without overriding __dir__]).

@dangotbanned, as you pointed out __getattr__ is invoked as a final effort when __getattribute__ fails noting that __getattribute__ will check in the instance __dict__, then the class __dict__, then the rest of the MRO class __dict__s (this was off the top of my head, so may be slightly inaccurate).
Though I am not concerned about speed here, the difference between calling through __getattr__ vs not will likely be a couple hundred nanoseconds? This is negligible compared to the computation time.

If performance is a consideration, we could probably use the defined method strings as a whitelist in __getattribute__ (as opposed to a blacklist in __getattr__).

class T: def __getattribute__(self, attribute): if attribute in INHERITED_METHODS: return ... # some dynamic func return object.__getattribute__(self, attribute) # pass the rest along

Thanks for accepting the summons and giving such a detailed overview @camriddell 😄

For a language with a guiding principle of

There should be one-- and preferably only one --obvious way to do it.

We certainly have a few options!

Performance

I'm planning to spin this off into an issue, as I'd like to figure out a good set of benchmarks first and then try out the various solutions.

They seem to be:

__getattr__ (current)

Decorator

Descriptor

__getattribute__

Though I am not concerned about speed here, the difference between calling through getattr vs not will likely be a couple hundred nanoseconds? This is negligible compared to the computation time.

I agree that the overhead for a single call is likely negligible. I'm more concerned with it being overhead that is added in so much of the core API.
If we take PolarsDataFrame as an example. It has 23 methods that currently hit the __getattr__ path.

I'd argue that at least 1 of these 5 (particularly the first 3) methods are likely to appear in most queries.
Whatever cost there is (if any) can very quickly add up when you throw method chaining into the mix

narwhals/narwhals/_polars/dataframe.py

Line 110 in b1c7e8e

filter: Method[Self]

narwhals/narwhals/_polars/dataframe.py

Line 120 in b1c7e8e

select: Method[Self]

narwhals/narwhals/_polars/dataframe.py

Line 125 in b1c7e8e

with_columns: Method[Self]

narwhals/narwhals/_polars/dataframe.py

Line 121 in b1c7e8e

sort: Method[Self]

narwhals/narwhals/_polars/dataframe.py

Line 116 in b1c7e8e

rename: Method[Self]

Note
To be clear, I'm saying the overhead seems to be an unknown that I'd like to quantify.
Since it seems reasonable to me that there could be overhead

Decorator vs Descriptor

The main tradeoffs between these two

The decorator pattern is fairly declarative (passing attribute names as a list[str]) and the logic may be easier to follow

The descriptor pattern is very declarative; but repetitive and a big "magic"

The nice thing about either approach is that is enhances discoverability (finding available attribute via dir [without overriding dir]).

Fun fact: in altair we use a decorator to add multiple descriptors!
Although I wouldn't recommend it here 😅

altair property setter methods

This all pre-dates when I started working on the project, but I did manage to simplify it somewhat recently in (vega/altair#3659)

https://github.com/vega/altair/blob/ea3a6e206b31407714e3619458b3cb9d9c4d7e92/altair/vegalite/v5/schema/channels.py#L269-L271

https://github.com/vega/altair/blob/ea3a6e206b31407714e3619458b3cb9d9c4d7e92/altair/vegalite/v5/schema/channels.py#L493-L749

https://github.com/vega/altair/blob/ea3a6e206b31407714e3619458b3cb9d9c4d7e92/altair/utils/schemapi.py#L1677-L1682

https://github.com/vega/altair/blob/ea3a6e206b31407714e3619458b3cb9d9c4d7e92/altair/utils/schemapi.py#L1628-L1674

The reason I was so quick to jump to using a descriptor for this case though was that for typing, we already need to add a lot of repetitive boilerplate like this:

Typing boilerplate

narwhals/narwhals/_polars/dataframe.py

Lines 67 to 71 in b1c7e8e

Method: TypeAlias = "Callable[..., R]"

"""Generic alias representing all methods implemented via `__getattr__`.

Where `R` is the return type.

"""

narwhals/narwhals/_polars/dataframe.py

Lines 104 to 129 in b1c7e8e

class PolarsDataFrame:

clone: Method[Self]

collect: Method[CompliantDataFrame[Any, Any, Any]]

drop_nulls: Method[Self]

estimated_size: Method[int | float]

explode: Method[Self]

filter: Method[Self]

gather_every: Method[Self]

item: Method[Any]

iter_rows: Method[Iterator[tuple[Any, ...]] | Iterator[Mapping[str, Any]]]

is_unique: Method[PolarsSeries]

join_asof: Method[Self]

rename: Method[Self]

row: Method[tuple[Any, ...]]

rows: Method[Sequence[tuple[Any, ...]] | Sequence[Mapping[str, Any]]]

sample: Method[Self]

select: Method[Self]

sort: Method[Self]

to_arrow: Method[pa.Table]

to_pandas: Method[pd.DataFrame]

unique: Method[Self]

with_columns: Method[Self]

# NOTE: `write_csv` requires an `@overload` for `str | None`

# Can't do that here 😟

write_csv: Method[Any]

write_parquet: Method[None]

We'd still need that if we used a decorator, plus the new set defining the names

INHERITED_METHODS

narwhals/narwhals/_polars/dataframe.py

Lines 73 to 101 in b1c7e8e

# DataFrame methods where PolarsDataFrame just defers to Polars.DataFrame directly.

INHERITED_METHODS = frozenset(

[

"clone",

"drop_nulls",

"estimated_size",

"explode",

"filter",

"gather_every",

"head",

"is_unique",

"item",

"iter_rows",

"join_asof",

"rename",

"row",

"rows",

"sample",

"select",

"sort",

"tail",

"to_arrow",

"to_pandas",

"unique",

"with_columns",

"write_csv",

"write_parquet",

]

)

But if we were to use a (generic) descriptor, I think we'd be able to write something like this which does the job of both 🤯:

class PolarsDataFrame: clone = Proxy["Self"]() collect = Proxy["CompliantDataFrame[Any, Any, Any]"]() drop_nulls = Proxy["Self"]() estimated_size = Proxy["int | float"]() explode = Proxy["Self"]() filter = Proxy["Self"]()

Note
Using string forward references here, since they were previously not in a runtime scope

MarcoGorelli added 5 commits April 29, 2025 08:28

fix: Ensure Polars wrapped in Narwhals can be passed to Joblib

90d41f5

skip test if no joblib

5bacb90

same for PolarsLazyFrame

78ffcd2

fixup

0340cca

coverage

1dacaed

MarcoGorelli mentioned this pull request Apr 29, 2025

[Bug]: AttributeError when using narwhalified Polars DataFrame with Joblib Parallel Processing #2450

Closed

MarcoGorelli marked this pull request as ready for review April 29, 2025 07:53

MarcoGorelli added the fix label Apr 29, 2025

dangotbanned reviewed Apr 29, 2025

View reviewed changes

MarcoGorelli added 3 commits April 29, 2025 13:29

make series pickleable too

d71bbce

pyright fixup

0721193

modin xfail

81c73e6

MarcoGorelli merged commit 8d17c46 into narwhals-dev:main Apr 29, 2025
30 checks passed

dangotbanned mentioned this pull request May 3, 2025

Minimizing overhead for polars proxy methods #2474

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Ensure Polars wrapped in Narwhals can be passed to Joblib #2451

fix: Ensure Polars wrapped in Narwhals can be passed to Joblib #2451

MarcoGorelli commented Apr 29, 2025

dangotbanned Apr 29, 2025 •

edited

Loading

MarcoGorelli Apr 29, 2025

dangotbanned Apr 29, 2025 •

edited

Loading

MarcoGorelli Apr 29, 2025

camriddell May 1, 2025

dangotbanned May 1, 2025 •

edited

Loading

	Method: TypeAlias = "Callable[..., R]"
	"""Generic alias representing all methods implemented via `__getattr__`.

	Where `R` is the return type.
	"""

	class PolarsDataFrame:
	clone: Method[Self]
	collect: Method[CompliantDataFrame[Any, Any, Any]]
	drop_nulls: Method[Self]
	estimated_size: Method[int \| float]
	explode: Method[Self]
	filter: Method[Self]
	gather_every: Method[Self]
	item: Method[Any]
	iter_rows: Method[Iterator[tuple[Any, ...]] \| Iterator[Mapping[str, Any]]]
	is_unique: Method[PolarsSeries]
	join_asof: Method[Self]
	rename: Method[Self]
	row: Method[tuple[Any, ...]]
	rows: Method[Sequence[tuple[Any, ...]] \| Sequence[Mapping[str, Any]]]
	sample: Method[Self]
	select: Method[Self]
	sort: Method[Self]
	to_arrow: Method[pa.Table]
	to_pandas: Method[pd.DataFrame]
	unique: Method[Self]
	with_columns: Method[Self]
	# NOTE: `write_csv` requires an `@overload` for `str \| None`
	# Can't do that here 😟
	write_csv: Method[Any]
	write_parquet: Method[None]

	# DataFrame methods where PolarsDataFrame just defers to Polars.DataFrame directly.
	INHERITED_METHODS = frozenset(
	[
	"clone",
	"drop_nulls",
	"estimated_size",
	"explode",
	"filter",
	"gather_every",
	"head",
	"is_unique",
	"item",
	"iter_rows",
	"join_asof",
	"rename",
	"row",
	"rows",
	"sample",
	"select",
	"sort",
	"tail",
	"to_arrow",
	"to_pandas",
	"unique",
	"with_columns",
	"write_csv",
	"write_parquet",
	]
	)

fix: Ensure Polars wrapped in Narwhals can be passed to Joblib #2451

fix: Ensure Polars wrapped in Narwhals can be passed to Joblib #2451

Conversation

MarcoGorelli commented Apr 29, 2025

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below

dangotbanned Apr 29, 2025 • edited Loading

Choose a reason for hiding this comment

MarcoGorelli Apr 29, 2025

Choose a reason for hiding this comment

dangotbanned Apr 29, 2025 • edited Loading

Choose a reason for hiding this comment

MarcoGorelli Apr 29, 2025

Choose a reason for hiding this comment

camriddell May 1, 2025

Choose a reason for hiding this comment

dangotbanned May 1, 2025 • edited Loading

Choose a reason for hiding this comment

Performance

Decorator vs Descriptor

dangotbanned Apr 29, 2025 •

edited

Loading

dangotbanned Apr 29, 2025 •

edited

Loading

dangotbanned May 1, 2025 •

edited

Loading