Skip to content

Fix field resolution for FORK #128193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 21, 2025
Merged

Fix field resolution for FORK #128193

merged 5 commits into from
May 21, 2025

Conversation

ioanatia
Copy link
Contributor

Should close #127208

Part of #121950

AFAICS we don't need to ask for all fields when FORK is used and the current field resolution should work since FORK is a N-ary plan.

@ioanatia ioanatia added >non-issue :Analytics/ES|QL AKA ESQL Team:SearchOrg Meta label for the Search Org (Enterprise Search) v9.1.0 labels May 20, 2025
@ioanatia ioanatia marked this pull request as ready for review May 20, 2025 18:24
@ioanatia ioanatia requested a review from ChrisHegarty May 20, 2025 18:24
@elasticsearchmachine elasticsearchmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) and removed Team:SearchOrg Meta label for the Search Org (Enterprise Search) labels May 20, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From code and tests in csv-spec files I noticed that FORK is "generating" a _fork (sort of hidden) field. Can you, please, add tests with queries that make use of this _fork field?

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query below (from rerank.csv-spec) is using as a field name _fork which is strange (many other queries using fork and rrf do this). This is the full list of field names it's asking for: book_no, author, title, _fork, title.*, _fork.*, author.*, book_no.*.

FROM books METADATA _id, _index, _score
            | FORK ( WHERE title:"Tolkien" | SORT _score, _id DESC | LIMIT 3 )
            ( WHERE author:"Tolkien" | SORT _score, _id DESC | LIMIT 3 )
            | RRF
            | RERANK "Tolkien" ON title WITH test_reranker
            | LIMIT 2
            | KEEP book_no, title, author

EsqlSession.fieldNames automatically ignores MetadataAttributes (like _id, _score, _index and others). I'm sure you've discussed about this: what was the reason for not considering _fork as a MetadataAttribute?

| WHERE a > 2000
| EVAL b = a + 100
| FORK (WHERE c > 1 AND a < 10000 | EVAL d = a + 500)
(STATS x = count(*), y=min(z))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm - now that I am looking at this more - I think we should return all fields, because one branch is not bounded by any command like KEEP or STATS that resets the output to a known list of fields. will fix 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one more query where I am confused, in part because of probably not knowing what is the expectation for fork.

FROM employees | FORK ( WHERE true | stats min(salary) by gender) ( WHERE true | LIMIT 3 )

This shows these columns:

min(salary) | gender | _fork | salary

Should these be a "union" kind of set of columns and fieldNames is only limiting it to what it "sees" in the fork's first branch? If that's true, this means that fieldNames should consider a union kind of field names from all the branches of fork. As a shortcut, the first branch that it finds that's the "widest" it should stop checking the rest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FORK output should be a union of the outputs of the fork branches.

If we detect that the same field exists in multiple fork branches outputs, but the field has different type, we fail with a validation error.

FORK will pad with null columns the output of FORK branches that are missing fields.
For example:

ROW a = [1, 2, 3], b = "foo"
| MV_EXPAND a
| FORK (WHERE true | LIMIT 2)
        (STATS x = count(*))
        (WHERE a % 2 == 0 | EVAL y = 2)
| SORT _fork, a
| KEEP _fork, a, b, x, y

Will produce:

     _fork     |       a       |       b       |       x       |       y       
---------------+---------------+---------------+---------------+---------------
fork1          |1              |foo            |null           |null           
fork1          |2              |foo            |null           |null           
fork2          |null           |null           |3              |null           
fork3          |2              |foo            |null           |2          

I think fixed the field resolution for this case.
In this test example here we should ask for all fields, since not all branches are bounded by an Aggregate or Project.

@ioanatia
Copy link
Contributor Author

ioanatia commented May 21, 2025

EsqlSession.fieldNames automatically ignores MetadataAttributes (like _id, _score, _index and others). I'm sure you've discussed about this: what was the reason for not considering _fork as a MetadataAttribute?

We could consider _fork a metadata attribute - but unlike _id, _index, you could actually have a _fork mapped field.
Promise to take this as a separate question that we can address, but for now we end up asking field_caps for the _fork field when we see it referenced.

Copy link
Contributor

@tteofili tteofili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Please, also, add that query I mentioned in one of my comments: FROM employees | FORK ( WHERE true | stats min(salary) by gender) ( WHERE true | LIMIT 3 ) to IndexResolverFieldNamesTests. Thank you.

@ioanatia ioanatia merged commit c8581b0 into elastic:main May 21, 2025
17 of 18 checks passed
@ioanatia ioanatia deleted the fork_field_names branch May 21, 2025 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ES|QL: tests for FORK's evaluation of field names used in field_caps resolve calls
5 participants