Skip to content

ESQL: Consider inlinestats when having field_caps check for field names #127564

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
May 22, 2025

Conversation

astefan
Copy link
Contributor

@astefan astefan commented Apr 30, 2025

The aggregate inside an inlinestats is "interfering" with the way field names are collected for field_caps requests. This made simple queries like from test | inlinestats max(whatever) by group to not return all fields from test, but to limit the resulting columns to whatever and group. inlinestats' purpose is to add columns to an already existent set of columns, which implies that this command has to be "transparent" to any wider collection of field names.

Fixes #127236

@astefan astefan requested a review from alex-spies April 30, 2025 12:53
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Apr 30, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Hi @astefan, I've created a changelog YAML for you.

@@ -360,14 +362,14 @@ FROM airports
| LIMIT 3
;

abbrev:keyword | city:keyword | region:text | "COUNT(*)":long
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but to previous work.

@@ -127,7 +127,7 @@ protected void shouldSkipTest(String testName) throws IOException {
assumeFalse("INLINESTATS not yet supported in CCS", testCase.requiredCapabilities.contains(INLINESTATS.capabilityName()));
assumeFalse("INLINESTATS not yet supported in CCS", testCase.requiredCapabilities.contains(INLINESTATS_V2.capabilityName()));
assumeFalse("INLINESTATS not yet supported in CCS", testCase.requiredCapabilities.contains(JOIN_PLANNING_V1.capabilityName()));
assumeFalse("INLINESTATS not yet supported in CCS", testCase.requiredCapabilities.contains(INLINESTATS_V5.capabilityName()));
assumeFalse("INLINESTATS not yet supported in CCS", testCase.requiredCapabilities.contains(INLINESTATS_V7.capabilityName()));
Copy link
Contributor Author

@astefan astefan Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V7 because I am trying to work on multiple separate issues. V6 should come from #127383

@astefan astefan added auto-backport Automatically create backport pull requests when merged >bug and removed >enhancement labels Apr 30, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @astefan, I've updated the changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Hi @astefan, I've updated the changelog YAML for you.

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @astefan ! The fix works and the added tests are nice. I found 2 buggy queries, but they are likely unrelated to this PR's work.

I think this solution is okay, but I'd prefer to avoid adding more complexity to the fieldNames method by special-casing for INLINESTATS. The fact this PR is required is because we parse INLINESTATS as an InlineStats node containing an Aggregate child (containing, in turn, the previous commands as grand-ancestors). Therefore, I'd like to suggest another approach which changes how we represent a parsed INLINESTATS - see below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heya, I tried some queries, trying to break things. I noticed 2 bugs which may or may not be related to this PR:

FROM hosts METADATA _index | eval x = ip1| INLINESTATS ip1 = COUNT(*) BY host_group, card| SORT ip1|LIMIT 1

gives an empty result, but removing the eval x = ip1 makes it work.

FROM hosts METADATA _index| INLINESTATS card = COUNT(*) BY card| SORT card|LIMIT 1

  description  |     host      |  host_group   |      ip0      |      ip1      |    _index     |     card      
---------------+---------------+---------------+---------------+---------------+---------------+---------------
alpha db server|alpha          |DB servers     |127.0.0.1      |127.0.0.1      |hosts          |eth0   

The card column has the wrong type, it should be a long - seems like we get the original index field here, instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this test to the suite. Data types are ok from my tests, there are other things wrong with that query. I've added details about the failure to the csv test suite.

Comment on lines +573 to +577
List<LogicalPlan> inlinestats = parsed.collect(InlineStats.class::isInstance);
Set<Aggregate> inlinestatsAggs = new HashSet<>();
for (var i : inlinestats) {
inlinestatsAggs.add(((InlineStats) i).aggregate());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The required solution here looks correct but confusing; this is because we parse INLINESTATS as an InlineStats node containing an Aggregate node as child, so we don't know for any given Aggregate if it's a STATS or an INLINESTATS, and the two have very different semantics.

I think we should rather just parse INLINESTATS as a single plan node - this would prevent this complexity.

Maybe consider refactoring the InlineStats node to avoid adding complexity here, as the fieldNames method is already hard to work with. A low effort fix would be to still have the InlineStats wrap an Aggregate, but not as its child - the actual child would be the preceding command.

Copy link
Contributor

@alex-spies alex-spies May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More generally, I wonder if there's an abstraction just around the corner that would make away with more special-casing inside this method.

In terms of the sets of attributes before and after INLINESTATS, it behaves similarly to EVAL, DISSECT, GROK, ENRICH and COMPLETION: some attributes are required because they are being referred to, some attributes are newly added and they shadow previous attributes. In the optimizer, we leverage this fact in the push down rules; for this, the plan nodes just need to implement the GeneratingPlan interface.

I think it'd be nice to move this method in a direction that would rely more on this general pattern.

That's out of scope for this PR, of course, but it'd also benefit from parsing INLINESTATS simply as 1 node rather than a combination of 2 nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are some good points (the use of GeneratingPlan and refactoring InlineStats), but I need more time to dig through these to prove these are valid changes to make. IMHO, the argument for simplifying what fieldNames is doing (looking at the aggregate inside an inlinestats) is not a strong one to warrant the refactoring. This change needs to be conceptually sound to make sense, ignoring the EsqlSession stuff.

Meaning, the conceptually sound argument needs to drive the refactoring and not the fact that fieldNames becomes more complex.

| inlinestats max(salary) by l
| stats min = min(salary) by l
| eval x = min + 1
| stats ca = count(*), cx = count(x) by l
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the same behavior is expected when this stats is replaced by a keep x, l (no wildcard), right?

Maybe let's add such tests, and also some where the STATS or KEEP (no wildcard) comes before the INLINESTATS, for good measure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more tests

@astefan astefan requested a review from costin May 20, 2025 14:40
Copy link
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Alex's observation in general, but I think the fix as is is fine and contained. We can consider redesigning INLINESTATS flowingly (maybe considering the join it actually is).

Comment on lines 582 to 583
plan -> plan instanceof Project
|| (plan instanceof Aggregate agg && (inlinestatsAggs.isEmpty() || inlinestatsAggs.contains(agg) == false))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
plan -> plan instanceof Project
|| (plan instanceof Aggregate agg && (inlinestatsAggs.isEmpty() || inlinestatsAggs.contains(agg) == false))
plan -> plan instanceof Project
|| plan instanceof Aggregate agg && inlinestatsAggs.contains(agg) == false

@astefan
Copy link
Contributor Author

astefan commented May 22, 2025

@elasticmachine run elasticsearch-ci/part-3

@astefan
Copy link
Contributor Author

astefan commented May 22, 2025

@elasticmachine run elasticsearch-ci/part-4

@astefan
Copy link
Contributor Author

astefan commented May 22, 2025

@elasticmachine run elasticsearch-ci/bwc-snapshots

@astefan
Copy link
Contributor Author

astefan commented May 22, 2025

@elasticmachine run elasticsearch-ci/part-4

@astefan astefan merged commit 28b10c3 into elastic:main May 22, 2025
18 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
9.0 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 127564

astefan added a commit to astefan/elasticsearch that referenced this pull request May 23, 2025
…es (elastic#127564)

* Make inlinestats "transparent" to EsqlSession.fieldNames

(cherry picked from commit 28b10c3)
elasticsearchmachine pushed a commit that referenced this pull request May 23, 2025
…es (#127564) (#128345)

* Make inlinestats "transparent" to EsqlSession.fieldNames

(cherry picked from commit 28b10c3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.0.3 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ESQL: columns lost with simple INLINESTATS query. EsqlSession.fieldNames test coverage needed
4 participants