Skip to content

ESQL: Consider inlinestats when having field_caps check for field names #127564

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

astefan
Copy link
Contributor

@astefan astefan commented Apr 30, 2025

The aggregate inside an inlinestats is "interfering" with the way field names are collected for field_caps requests. This made simple queries like from test | inlinestats max(whatever) by group to not return all fields from test, but to limit the resulting columns to whatever and group. inlinestats' purpose is to add columns to an already existent set of columns, which implies that this command has to be "transparent" to any wider collection of field names.

Fixes #127236

@astefan astefan requested a review from alex-spies April 30, 2025 12:53
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Apr 30, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Hi @astefan, I've created a changelog YAML for you.

@@ -360,14 +362,14 @@ FROM airports
| LIMIT 3
;

abbrev:keyword | city:keyword | region:text | "COUNT(*)":long
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but to previous work.

@@ -127,7 +127,7 @@ protected void shouldSkipTest(String testName) throws IOException {
assumeFalse("INLINESTATS not yet supported in CCS", testCase.requiredCapabilities.contains(INLINESTATS.capabilityName()));
assumeFalse("INLINESTATS not yet supported in CCS", testCase.requiredCapabilities.contains(INLINESTATS_V2.capabilityName()));
assumeFalse("INLINESTATS not yet supported in CCS", testCase.requiredCapabilities.contains(JOIN_PLANNING_V1.capabilityName()));
assumeFalse("INLINESTATS not yet supported in CCS", testCase.requiredCapabilities.contains(INLINESTATS_V5.capabilityName()));
assumeFalse("INLINESTATS not yet supported in CCS", testCase.requiredCapabilities.contains(INLINESTATS_V7.capabilityName()));
Copy link
Contributor Author

@astefan astefan Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V7 because I am trying to work on multiple separate issues. V6 should come from #127383

@astefan astefan added auto-backport Automatically create backport pull requests when merged >bug and removed >enhancement labels Apr 30, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @astefan, I've updated the changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Hi @astefan, I've updated the changelog YAML for you.

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @astefan ! The fix works and the added tests are nice. I found 2 buggy queries, but they are likely unrelated to this PR's work.

I think this solution is okay, but I'd prefer to avoid adding more complexity to the fieldNames method by special-casing for INLINESTATS. The fact this PR is required is because we parse INLINESTATS as an InlineStats node containing an Aggregate child (containing, in turn, the previous commands as grand-ancestors). Therefore, I'd like to suggest another approach which changes how we represent a parsed INLINESTATS - see below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heya, I tried some queries, trying to break things. I noticed 2 bugs which may or may not be related to this PR:

FROM hosts METADATA _index | eval x = ip1| INLINESTATS ip1 = COUNT(*) BY host_group, card| SORT ip1|LIMIT 1

gives an empty result, but removing the eval x = ip1 makes it work.

FROM hosts METADATA _index| INLINESTATS card = COUNT(*) BY card| SORT card|LIMIT 1

  description  |     host      |  host_group   |      ip0      |      ip1      |    _index     |     card      
---------------+---------------+---------------+---------------+---------------+---------------+---------------
alpha db server|alpha          |DB servers     |127.0.0.1      |127.0.0.1      |hosts          |eth0   

The card column has the wrong type, it should be a long - seems like we get the original index field here, instead.

Comment on lines +573 to +577
List<LogicalPlan> inlinestats = parsed.collect(InlineStats.class::isInstance);
Set<Aggregate> inlinestatsAggs = new HashSet<>();
for (var i : inlinestats) {
inlinestatsAggs.add(((InlineStats) i).aggregate());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The required solution here looks correct but confusing; this is because we parse INLINESTATS as an InlineStats node containing an Aggregate node as child, so we don't know for any given Aggregate if it's a STATS or an INLINESTATS, and the two have very different semantics.

I think we should rather just parse INLINESTATS as a single plan node - this would prevent this complexity.

Maybe consider refactoring the InlineStats node to avoid adding complexity here, as the fieldNames method is already hard to work with. A low effort fix would be to still have the InlineStats wrap an Aggregate, but not as its child - the actual child would be the preceding command.

Copy link
Contributor

@alex-spies alex-spies May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More generally, I wonder if there's an abstraction just around the corner that would make away with more special-casing inside this method.

In terms of the sets of attributes before and after INLINESTATS, it behaves similarly to EVAL, DISSECT, GROK, ENRICH and COMPLETION: some attributes are required because they are being referred to, some attributes are newly added and they shadow previous attributes. In the optimizer, we leverage this fact in the push down rules; for this, the plan nodes just need to implement the GeneratingPlan interface.

I think it'd be nice to move this method in a direction that would rely more on this general pattern.

That's out of scope for this PR, of course, but it'd also benefit from parsing INLINESTATS simply as 1 node rather than a combination of 2 nodes.

| inlinestats max(salary) by l
| stats min = min(salary) by l
| eval x = min + 1
| stats ca = count(*), cx = count(x) by l
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the same behavior is expected when this stats is replaced by a keep x, l (no wildcard), right?

Maybe let's add such tests, and also some where the STATS or KEEP (no wildcard) comes before the INLINESTATS, for good measure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.0.2 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ESQL: columns lost with simple INLINESTATS query. EsqlSession.fieldNames test coverage needed
3 participants