Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix performance regression with stddev being enabled by default #840

Closed
wants to merge 22 commits into from

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Aug 18, 2024

Which issue does this PR close?

Related to #824

Rationale for this change

Fix a performance regression and simplify configs for enabling operators and expressions

tpcds_allqueries

We now fall back to Spark for stddev_sample aggregates.

image

What changes are included in this PR?

  • stddev is now disabled by default. There was a recent regression related to configs that had enabled this by default, and this operation is much slower in DataFusion than in Spark.
  • Remove spark.comet.exec.all.enabled and enable all operators by default. Each operator can be disabled individually be changing it's enabled config to false. These are all documented.

How are these changes tested?

Existing tests

@andygrove andygrove changed the title WIP: misc perf improvements WIP: Fix performance regression with stddev being enabled by default Aug 18, 2024
Comment on lines 1212 to 1221
fn prep_join_input(plan: Arc<dyn ExecutionPlan>) -> Arc<dyn ExecutionPlan> {
if can_reuse_input_batch(&plan) {
Arc::new(CopyExec::new(plan, CopyMode::UnpackOrDeepCopy))
} else {
Arc::new(CopyExec::new(plan, CopyMode::UnpackOrClone))
}
}

let left = prep_join_input(left);
let right = prep_join_input(right);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code cleanup: no functional change

@codecov-commenter
Copy link

codecov-commenter commented Aug 18, 2024

Codecov Report

Attention: Patch coverage is 53.01205% with 39 lines in your changes missing coverage. Please review.

Project coverage is 34.04%. Comparing base (c9af3f4) to head (a868355).
Report is 1 commits behind head on main.

Files Patch % Lines
.../scala/org/apache/comet/serde/QueryPlanSerde.scala 53.06% 8 Missing and 15 partials ⚠️
...org/apache/comet/CometSparkSessionExtensions.scala 0.00% 1 Missing and 15 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #840      +/-   ##
============================================
- Coverage     34.05%   34.04%   -0.01%     
+ Complexity      879      861      -18     
============================================
  Files           112      112              
  Lines         42959    42912      -47     
  Branches       9488     9473      -15     
============================================
- Hits          14629    14609      -20     
+ Misses        25330    25312      -18     
+ Partials       3000     2991       -9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@andygrove
Copy link
Member Author

@parthchandra Could I get your opinion on the config changes in this PR? I'm not sure this is the best approach but it was previously not possible to have a default value of false for spark.comet.exec.OPNAME.enabled because it would be enabled anyway if all execs were enabled.

@andygrove andygrove mentioned this pull request Aug 19, 2024
5 tasks
@parthchandra
Copy link
Contributor

I'm not sure this is the best approach but it was previously not possible to have a default value of false for spark.comet.exec.OPNAME.enabled because it would be enabled anyway if all execs were enabled.

Would it be better to have have an exclusion list (does not have to be a config, could just be an internal list) which is explicitly checked if EXEC_ALL is enabled?
So op is enabled if ( (EXEC_ALL && not in EXEC_ALL_EXCLUSIONS) || op is enabled) ?
We can include an op in exclusions if we feel it is not ready and it can be enabled explicitly for development/testing.

@andygrove andygrove changed the title WIP: Fix performance regression with stddev being enabled by default Fix performance regression with stddev being enabled by default Aug 21, 2024
@andygrove andygrove marked this pull request as ready for review August 21, 2024 20:33
@andygrove
Copy link
Member Author

There is an upstream PR in DataFusion to improve stddev performance
apache/datafusion#12095

@andygrove
Copy link
Member Author

I am closing this issue because we can disable this expression via a config now that #855 is merged

@andygrove andygrove closed this Aug 25, 2024
@andygrove andygrove deleted the perf-aug-18 branch August 25, 2024 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants