Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Remove one redundant CopyExec for SMJ #962

Merged
merged 14 commits into from
Sep 27, 2024
Merged

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Sep 24, 2024

Which issue does this PR close?

N/A

Rationale for this change

We were injecting too many CopyExecs for SMJ. Plans looked like this:

SortMergeJoin
  CopyExec [UnpackOrDeepClone]
    SortExec
      CopyExec [UnpackOrDeepClone]
        ...
  CopyExec [UnpackOrDeepClone]
    SortExec
      CopyExec [UnpackOrDeepClone]
        ...

What changes are included in this PR?

With the changes in this PR, the plan now looks like:

SortMergeJoin
  SortExec
    CopyExec [UnpackOrDeepClone]
      ...
  SortExec
    CopyExec [UnpackOrDeepClone]
      ...

How are these changes tested?

Existing tests

@andygrove andygrove marked this pull request as ready for review September 24, 2024 18:28
is_sort_merge: bool,
plan: Arc<dyn ExecutionPlan>,
) -> Arc<dyn ExecutionPlan> {
if is_sort_merge {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more general description for SMJ's property to name this variable, and then we could use this function for the duplicated code elsewhere in this file?

let copy_exec = if can_reuse_input_batch(&child) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will refactor this to reduce duplication

@andygrove andygrove marked this pull request as draft September 24, 2024 20:48
@andygrove
Copy link
Member Author

andygrove commented Sep 24, 2024

I ran TPC-DS benchmarks comparing the latest from the main branch with this PR, and somehow, this PR introduces a considerable regression in performance. I am investigating.

edit: I ran the wrong script 🤦

@andygrove andygrove marked this pull request as ready for review September 24, 2024 21:13
@andygrove
Copy link
Member Author

andygrove commented Sep 24, 2024

A single run of TPC-DS comparing main to this PR does not show a significant difference (~1% difference). I did not expect this to make much difference, but it does remove a redundant operator and simplified the plan,

Comment on lines -1266 to -1280
// DataFusion Join operators keep the input batch internally. We need
// to copy the input batch to avoid the data corruption from reusing the input
// batch.
let left = if can_reuse_input_batch(&left) {
Arc::new(CopyExec::new(left, CopyMode::UnpackOrDeepCopy))
} else {
Arc::new(CopyExec::new(left, CopyMode::UnpackOrClone))
};

let right = if can_reuse_input_batch(&right) {
Arc::new(CopyExec::new(right, CopyMode::UnpackOrDeepCopy))
} else {
Arc::new(CopyExec::new(right, CopyMode::UnpackOrClone))
};

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is specific to HashJoinExec and not to SortMergeJoinExec, so I moved it out of this code, which is common to both joins.

Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This a great catch, and a super clean refactor.

} else {
Arc::new(CopyExec::new(child, CopyMode::UnpackOrClone))
};
let child = Self::wrap_in_copy_exec(child);
Copy link
Contributor

@mbutrovich mbutrovich Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a similar comment here as you added with HashJoin on why Sort needs a CopyExec?

@andygrove andygrove merged commit a690e9d into apache:main Sep 27, 2024
74 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants