-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make combine_similar less greedy for merge #334
Conversation
# Conflicts: # dask_expr/_merge.py
@@ -206,3 +207,26 @@ def test_merge_combine_similar(npartitions_left, npartitions_right): | |||
expected["new"] = expected.b + expected.c | |||
expected = expected.groupby(["a", "e", "x"]).new.sum() | |||
assert_eq(query, expected) | |||
|
|||
|
|||
def test_merge_combine_similar_intermediate_projections(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reproducer for the behavior being targeted by this PR? This test seems to pass on main
for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the bug was blocked by the additional layer in #333 but shows up now
# Conflicts: # dask_expr/_merge.py
Previous tree with the reproducer:
We have the 2 Join layers right next to each other without dropping unnecessary columns The new tree:
We have an intermediate Projection that drops unnecessary columns right away |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is pretty hard to follow (mostly because it was already complicated), but I don't have any better ideas, and I don't think it is worth the effort to spend much time attempting a simplification until we have addressed #336
dask_expr/_merge.py
Outdated
@@ -23,6 +23,7 @@ | |||
from dask_expr._util import _convert_to_list | |||
|
|||
_HASH_COLUMN_NAME = "__hash_partition" | |||
_PARTITION_COLUMNS = "_partitions" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Rename to _PARTITION_COLUMN
for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good point, thx
This makes combine_similar less greedy, we are currently removing projections on both sides and see if we are equal, we should do one after another to avoid pushing projections up unnecessarily. Shuffling is expensive, so this should help quite a bit.
This sits on top of #333, the previous structure made this change even harder that it was anyway