-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(rust!): Refactor AnyValue supertype logic #15280
Conversation
8b337c0
to
e721659
Compare
{ | ||
let mut supertype = DataType::Null; | ||
let mut dtypes = PlHashSet::<DataType>::new(); | ||
for av in values { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for context. The indexset perf might be better, as you keep a smaller loop. Now you have a branch on every iteration of N
and we expect N
to be much bigger than K
, where N
is no. of elements and K
is no. of unique datatypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize this type of loop would be slower - I figured the collect
to the IndexSet would have to do something similar (check whether the dtype hash is already in the set before adding it). Any recommendations on where I can read up on this kind of stuff?
The benefit of this implementation is that it can early exit if there is no supertype (before going through all N
values), but I guess that's relatively rare so if it makes the loop slower it's not worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indexset is a different way of storing the values (inline in a vec, instead of the hash slots), the hash slots will store indexes to that vec. Can elaborate Thursday. ^^
but I guess that's relatively rare so if it makes the loop slower
I am not saying that it does :P. It is tough to tell. Though often errors in tight loops are not good. But I think the runtime is in the hashtable more than the branches here.
@@ -419,3 +420,12 @@ pub(crate) fn py_object_to_any_value(ob: &PyAny, strict: bool) -> PyResult<AnyVa | |||
}) | |||
}) | |||
} | |||
|
|||
fn any_values_to_supertype_and_n_types(values: &[AnyValue]) -> PolarsResult<(DataType, usize)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this generic? Then fn any_values_to_supertype
can dispatch to this one and just get the DataType
out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I solved it slightly differently (extract the logic and re-use).
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #15280 +/- ##
==========================================
- Coverage 81.32% 81.32% -0.01%
==========================================
Files 1359 1360 +1
Lines 176072 176070 -2
Branches 2526 2526
==========================================
- Hits 143193 143188 -5
- Misses 32396 32399 +3
Partials 483 483 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clean second pass. 👍
Changes
any_values_to_dtype
toany_values_to_supertype_and_n_dtypes
and move it to thepolars_core::utils
module. (this is the breaking part)dtypes_to_supertype
for inferring the supertype of multiple dtypesany_values_to_supertype
for inferring the supertype of a collection of AnyValues