-
Notifications
You must be signed in to change notification settings - Fork 25.3k
ESQL: Allow pruning columns added by InlineJoin #131204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This updates PruneColumns rule to allow optimizing away the columns added by InlineJoin, dropping unnecessary agg'ing.
Pinging @elastic/es-analytical-engine (Team:Analytics) |
Hi @bpintea, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's looking good, but the refactoring can be made a bit more digestible with more comments and smarter method names.
I didn't check the correctness of the optimizer rule unit tests, only looked at the queries they are using.
I have noticed that, after chasing another bug I discovered, that it's fixing another behavior as well.
In #124715, it was described as FROM kibana_sample_data_logs | EVAL timestamp=DATE_TRUNC(5 minute, @timestamp) | INLINESTATS results = count(*) by timestamp | keep results, timestamp
. For the dataset used in csv-spec files:
FROM employees | WHERE still_hired == false | EVAL sal = salary/1000 | INLINESTATS total = SUM(sal), count=COUNT(*) BY gender | keep emp_no, still_hired, total, count | SORT emp_no
FROM employees | EVAL sal = salary/1000 | INLINESTATS total = SUM(sal), count=COUNT(*) BY gender | keep emp_no, still_hired, total, count | WHERE still_hired == false
FROM employees | EVAL sal = salary/1000 | INLINESTATS total = SUM(sal) BY gender | keep emp_no, still_hired, total
row salary = 12300, emp_no = 5, gender = "F" | eval sal = salary/1000 | inlinestats sum(sal) by gender | keep emp_no
^ here the presence of that evaluation before inlinestats is at least one of the triggers. We do have some queries with eval
before inlinestats
and with keep
after it which succeed. Those above do not. I recommend adding some flavors of these to the csv-spec file.
// track inlinestats' own aggregation output (right-hand side of the join) so that any other plan on the left-hand side of the | ||
// inline join won't have its columns pruned due to the lack of "visibility" into the right hand side output/Attributes | ||
var inlineJoinRightOutput = new ArrayList<Attribute>(); | ||
return pruneColumns(plan, plan.outputSet().asBuilder(), false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This a harder to follow code, but with some comments it can be made easier to digest.
For one, I would two methods; one LogicalPlan pruneColumns(LogicalPlan plan, AttributeSet.Builder used)
and use it here. This first method would call pruneColumns(plan, plan.outputSet().asBuilder(), false)
.
The second method would be pruneInlineJoinColumns(LogicalPlan plan, AttributeSet.Builder used)
and call pruneColumns(plan, plan.outputSet().asBuilder(), true)
.
p = aggregate.with(aggregate.groupings(), remaining); | ||
} | ||
} else { | ||
if (inlineJoin && aggregate.groupings().containsAll(remaining)) { // not expecting high groups cardinality |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment tbh // not expecting high groups cardinality
|
||
var remaining = pruneUnusedAndAddReferences(aggregate.aggregates(), used); | ||
|
||
if (remaining != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if you checked here if (remaining == null) return p;
? to have a smaller if
branch?
if (inlineJoin && aggregate.groupings().containsAll(remaining)) { // not expecting high groups cardinality | ||
// It's an INLINEJOIN and all remaining attributes are groupings, which are already part of the IJ output (from the | ||
// left-hand side). | ||
if (aggregate.child() instanceof StubRelation stub) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this condition will need an update when we start supporting filters on aggs... if that guess is valid, I think a TODO is in order here. If you will add that TODO, for some (all, most I hope) I added // TODO Inlinestats:....
to find them quicker later.
return p; | ||
} | ||
|
||
private static LogicalPlan pruneColumns(Project project, AttributeSet.Builder used) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the meat of the PR, right?
p = remaining.isEmpty() || remaining.stream().allMatch(FieldAttribute.class::isInstance) | ||
? emptyLocalRelation(project) | ||
: new Project(project.source(), project.child(), remaining); | ||
} else if (project.output().stream().allMatch(FieldAttribute.class::isInstance)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we interested in this aspect? (the fact that the projection is made of fields only)
This updates
PruneColumns
rule to allow optimizing away the columns added byInlineJoin
, dropping unnecessary agg'ing.