-
Notifications
You must be signed in to change notification settings - Fork 141
fix: Ensure Polars wrapped in Narwhals can be passed to Joblib #2451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
narwhals/_polars/dataframe.py
Outdated
# DataFrame methods where PolarsDataFrame just defers to Polars.DataFrame | ||
# directly. | ||
INHERITED_METHODS = frozenset( | ||
[ | ||
"clone", | ||
"drop_nulls", | ||
"estimated_size", | ||
"explode", | ||
"filter", | ||
"gather_every", | ||
"head", | ||
"is_unique", | ||
"item", | ||
"iter_rows", | ||
"join_asof", | ||
"rename", | ||
"row", | ||
"rows", | ||
"sample", | ||
"select", | ||
"sort", | ||
"tail", | ||
"to_arrow", | ||
"to_pandas", | ||
"unique", | ||
"with_columns", | ||
"write_csv", | ||
"write_parquet", | ||
] | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MarcoGorelli, I've got two suggestions.
- Wouldn't we also need this for all the other
Polars*
classes that use__getattr__
as a proxy?
I'm thinking a search for: Method[
would highlight which ones - but also the(Series|Expr)*Namespace
classes - If we're defining all the cases up front, have you thought about using a descriptor instead?
__getattr__
makes sense at the moment because we're theoretically allowing things outside of thenarwhals
API to pass through.
In practice, I understand that might not be possible from the user-side - but it could mean we can just rely onpython
to raise theAttributeError
instead of checking and raising ourselves π€
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dangotbanned !
- I don't think is a concern, as users work with
DataFrame
/LazyFrame
/Series
structures. Nobody would make a function which acceptsseries.dt
as an input :) - I'm not familiar with descriptors, but open to suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dangotbanned !
- I don't think is a concern, as users work with
DataFrame
/LazyFrame
/Series
structures. Nobody would make a function which acceptsseries.dt
as an input :)
Sounds good π
- I'm not familiar with descriptors, but open to suggestions
If you're looking to include this PR in a bugfix - we can punt this to another issue?
Some useful homework though π
In short, I think we can reduce the performance overhead of always looking up (and failing) in order to reach __getattr__
- since you've defined the full set of methods that are permitted.
So the idea would be to move a cost that is currently paid per-instance and per-method into one that is per-type for the fixed set of methods π
Pinging @camriddell to check if what I'm saying seems legit π
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking at this might I propose the use of a decorator to dynamically add these methods to the class and then do away with the dynamic look up/attribute fetching via __getattr__
altogether?
The pattern here typically looks like
def define_attributes(attributes):
def decorator(cls):
for attr in attributes:
setattr(cls, attr, ...) # insert dynamically defined function here
return cls
return decorator
@define_attributes(['min', 'max'])
class T:
pass
t = T()
print(f'{t.min = }') # t.min = Ellipsis
Whereas the descriptor pattern may look like:
class PassThrough:
def __set_name__(self, owner, name):
self._name = name
return name
def __get__(self, instance, owner):
# func = getattr(..., self._name) # get the underlying function
if instance is None:
return ... # special return for class look ups
return ... # return for instance look ups
class T:
min = PassThrough()
t = T()
print(f'{t.min = }')
The main tradeoffs between these two
- The decorator pattern is fairly declarative (passing attribute names as a
list[str]
) and the logic may be easier to follow - The descriptor pattern is very declarative; but repetitive and a big "magic"
The nice thing about either approach is that is enhances discoverability (finding available attribute via dir
[without overriding __dir__
]).
@dangotbanned, as you pointed out __getattr__
is invoked as a final effort when __getattribute__
fails noting that __getattribute__
will check in the instance __dict__
, then the class __dict__
, then the rest of the MRO class __dict__
s (this was off the top of my head, so may be slightly inaccurate).
Though I am not concerned about speed here, the difference between calling through __getattr__
vs not will likely be a couple hundred nanoseconds? This is negligible compared to the computation time.
If performance is a consideration, we could probably use the defined method strings as a whitelist in __getattribute__
(as opposed to a blacklist in __getattr__
).
class T:
def __getattribute__(self, attribute):
if attribute in INHERITED_METHODS:
return ... # some dynamic func
return object.__getattribute__(self, attribute) # pass the rest along
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for accepting the summons and giving such a detailed overview @camriddell π
For a language with a guiding principle of
There should be one-- and preferably only one --obvious way to do it.
We certainly have a few options!
Performance
I'm planning to spin this off into an issue, as I'd like to figure out a good set of benchmarks first and then try out the various solutions.
They seem to be:
__getattr__
(current)- Decorator
- Descriptor
__getattribute__
Though I am not concerned about speed here, the difference between calling through getattr vs not will likely be a couple hundred nanoseconds? This is negligible compared to the computation time.
I agree that the overhead for a single call is likely negligible. I'm more concerned with it being overhead that is added in so much of the core API.
If we take PolarsDataFrame
as an example. It has 23 methods that currently hit the __getattr__
path.
I'd argue that at least 1 of these 5 (particularly the first 3) methods are likely to appear in most queries.
Whatever cost there is (if any) can very quickly add up when you throw method chaining into the mix
narwhals/narwhals/_polars/dataframe.py
Line 110 in b1c7e8e
filter: Method[Self] |
narwhals/narwhals/_polars/dataframe.py
Line 120 in b1c7e8e
select: Method[Self] |
narwhals/narwhals/_polars/dataframe.py
Line 125 in b1c7e8e
with_columns: Method[Self] |
narwhals/narwhals/_polars/dataframe.py
Line 121 in b1c7e8e
sort: Method[Self] |
narwhals/narwhals/_polars/dataframe.py
Line 116 in b1c7e8e
rename: Method[Self] |
Note
To be clear, I'm saying the overhead seems to be an unknown that I'd like to quantify.
Since it seems reasonable to me that there could be overhead
Decorator vs Descriptor
The main tradeoffs between these two
- The decorator pattern is fairly declarative (passing attribute names as a list[str]) and the logic may be easier to follow
- The descriptor pattern is very declarative; but repetitive and a big "magic"
The nice thing about either approach is that is enhances discoverability (finding available attribute via dir [without overriding dir]).
Fun fact: in altair
we use a decorator to add multiple descriptors!
Although I wouldn't recommend it here π
altair
property setter methods
This all pre-dates when I started working on the project, but I did manage to simplify it somewhat recently in (vega/altair#3659)
The reason I was so quick to jump to using a descriptor for this case though was that for typing, we already need to add a lot of repetitive boilerplate like this:
Typing boilerplate
narwhals/narwhals/_polars/dataframe.py
Lines 67 to 71 in b1c7e8e
Method: TypeAlias = "Callable[..., R]" | |
"""Generic alias representing all methods implemented via `__getattr__`. | |
Where `R` is the return type. | |
""" |
narwhals/narwhals/_polars/dataframe.py
Lines 104 to 129 in b1c7e8e
class PolarsDataFrame: | |
clone: Method[Self] | |
collect: Method[CompliantDataFrame[Any, Any, Any]] | |
drop_nulls: Method[Self] | |
estimated_size: Method[int | float] | |
explode: Method[Self] | |
filter: Method[Self] | |
gather_every: Method[Self] | |
item: Method[Any] | |
iter_rows: Method[Iterator[tuple[Any, ...]] | Iterator[Mapping[str, Any]]] | |
is_unique: Method[PolarsSeries] | |
join_asof: Method[Self] | |
rename: Method[Self] | |
row: Method[tuple[Any, ...]] | |
rows: Method[Sequence[tuple[Any, ...]] | Sequence[Mapping[str, Any]]] | |
sample: Method[Self] | |
select: Method[Self] | |
sort: Method[Self] | |
to_arrow: Method[pa.Table] | |
to_pandas: Method[pd.DataFrame] | |
unique: Method[Self] | |
with_columns: Method[Self] | |
# NOTE: `write_csv` requires an `@overload` for `str | None` | |
# Can't do that here π | |
write_csv: Method[Any] | |
write_parquet: Method[None] |
We'd still need that if we used a decorator, plus the new set defining the names
INHERITED_METHODS
narwhals/narwhals/_polars/dataframe.py
Lines 73 to 101 in b1c7e8e
# DataFrame methods where PolarsDataFrame just defers to Polars.DataFrame directly. | |
INHERITED_METHODS = frozenset( | |
[ | |
"clone", | |
"drop_nulls", | |
"estimated_size", | |
"explode", | |
"filter", | |
"gather_every", | |
"head", | |
"is_unique", | |
"item", | |
"iter_rows", | |
"join_asof", | |
"rename", | |
"row", | |
"rows", | |
"sample", | |
"select", | |
"sort", | |
"tail", | |
"to_arrow", | |
"to_pandas", | |
"unique", | |
"with_columns", | |
"write_csv", | |
"write_parquet", | |
] | |
) |
But if we were to use a (generic) descriptor, I think we'd be able to write something like this which does the job of both π€―:
class PolarsDataFrame:
clone = Proxy["Self"]()
collect = Proxy["CompliantDataFrame[Any, Any, Any]"]()
drop_nulls = Proxy["Self"]()
estimated_size = Proxy["int | float"]()
explode = Proxy["Self"]()
filter = Proxy["Self"]()
Note
Using string forward references here, since they were previously not in a runtime scope
closes #2450
What type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below