-
Notifications
You must be signed in to change notification settings - Fork 685
perf(mutate): skip dereferencing cols preserved in .mutate #11458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- preserve order of fields in same way as existing implementation
|
sorry for clarity both accounts listed as committing here (hottwa, jc-5s) are me |
ibis/expr/types/relations.py
Outdated
| values = {**node.fields, **values} | ||
| return self.select(**values) | ||
| # kept fields from node.fields can skip dereferencing in .select() to improve performance | ||
| return self.select(_exprs_that_do_not_need_dereferencing=node.fields, **values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of passing a special reserved keyword argument we should rather duplicate the implementation of .select() which is actually fairly short. That way we have clear control over about what to dereference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like
def mutate(self, *exprs, **mutations):
from ibis.expr.rewrites import rewrite_project_input
node = self.op()
values = self.bind(*exprs, **mutations)
values = unwrap_aliases(values)
# we need to detect reductions which are either turned into window functions
# or scalar subqueries depending on whether they are originating from self
values = {
k: rewrite_project_input(v, relation=self.op()) for k, v in values.items()
}
return ops.Project(self, {**node.fields, **values}).to_expr()|
Thanks for reviewing so quickly & the comments which I've implemented. They have a bonus side effect which is that an additional call to .bind() within the removed call to .select(), which appears to be superfluous, is now omitted. So runtime is now about 30% of original (see below for profile). Around 50% of remaining time is spent in DerefMap.from_targets. @kszucs you suggested caching this on Tables here, which I would be in favour of if acceptable to ibis team? This would be beneficial to all table operations as well as .mutate A typical use case we have is to run many relatively simple queries on a base table, so having the derefmap cached would be quite beneficial from our POV. |
|
if Derefmaps are cached, then their memory requirements for a length D chain of operations with N fields is O(N*D^2), and each new derefmap on the chain takes incrementally more time to build than the previous one (that's already the case though), so that's probably not a great way forward Another approach would be to only cache 1 layer depth of derefs on each table, and then do something like: this would have following characteristics:
would that be worth considering? |
| │ Adelie │ 2007 │ -7.22193 │ | ||
| └─────────┴───────┴────────────────┘ | ||
| """ | ||
| from ibis.expr.rewrites import rewrite_project_input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a comment here to keep the implementation of .mutate() and .select() in sync except for the **node.fields part? This might help preserving feature parity between the two methods.
|
@jc-5s yes, we should find a better way to make dereferencing less resource hungry. Your described solution sounds good on principle but there are a lot of details we need to pay attention to. Feel free to put up a follow-up PR then I will take a look. |
ibis/expr/types/relations.py
Outdated
| value = value.name(original.get_name()) | ||
| result.append(value) | ||
| return tuple(result) | ||
| return tuple(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra trailing whitespace.
kszucs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from my side.
@cpcloud please merge when you have the chance.
|
thanks again, I've made the linting fixes and added comments to select and mutate about keeping them in sync will come back to the DerefMap idea at another point as time allows :) to fix the "conventional commits" issue, is it best to e.g. squash the PR and follow style e.g. here? Just LMK if anything else needed. Thanks! |
|
Thanks @hottwaj and @kszucs! |
Description of changes
Table.select()now includes a "private" kwarg_exprs_that_do_not_need_dereferencingwhich is an optional dict of values that do not need "dereferencing" viaTable.bind- this should be used for values that are already known to be "dereferenced" to reduce cost of that step_exprs_that_do_not_need_dereferencingis now passed byTable.mutateand contains columns onself, so that if a wide table is mutated, existing columns are not dereferenced, and only new ones are instead.With this change, time needed to add a column to a wide table is reduced by about 50% for a 100 column table.
100 mutates on 100 column table before:
100 mutates on 100 column table after
Issues closed