-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Fix field resolution for FORK #128193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix field resolution for FORK #128193
Conversation
Pinging @elastic/es-analytical-engine (Team:Analytics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From code and tests in csv-spec files I noticed that FORK
is "generating" a _fork
(sort of hidden) field. Can you, please, add tests with queries that make use of this _fork
field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The query below (from rerank.csv-spec
) is using as a field name _fork
which is strange (many other queries using fork
and rrf
do this). This is the full list of field names it's asking for: book_no, author, title, _fork, title.*, _fork.*, author.*, book_no.*
.
FROM books METADATA _id, _index, _score
| FORK ( WHERE title:"Tolkien" | SORT _score, _id DESC | LIMIT 3 )
( WHERE author:"Tolkien" | SORT _score, _id DESC | LIMIT 3 )
| RRF
| RERANK "Tolkien" ON title WITH test_reranker
| LIMIT 2
| KEEP book_no, title, author
EsqlSession.fieldNames
automatically ignores MetadataAttribute
s (like _id
, _score
, _index
and others). I'm sure you've discussed about this: what was the reason for not considering _fork
as a MetadataAttribute
?
| WHERE a > 2000 | ||
| EVAL b = a + 100 | ||
| FORK (WHERE c > 1 AND a < 10000 | EVAL d = a + 500) | ||
(STATS x = count(*), y=min(z)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm - now that I am looking at this more - I think we should return all fields, because one branch is not bounded by any command like KEEP
or STATS
that resets the output to a known list of fields. will fix 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one more query where I am confused, in part because of probably not knowing what is the expectation for fork
.
FROM employees | FORK ( WHERE true | stats min(salary) by gender) ( WHERE true | LIMIT 3 )
This shows these columns:
min(salary) | gender | _fork | salary
Should these be a "union" kind of set of columns and fieldNames
is only limiting it to what it "sees" in the fork
's first branch? If that's true, this means that fieldNames
should consider a union kind of field names from all the branches of fork
. As a shortcut, the first branch that it finds that's the "widest" it should stop checking the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FORK output should be a union of the outputs of the fork branches.
If we detect that the same field exists in multiple fork branches outputs, but the field has different type, we fail with a validation error.
FORK will pad with null columns the output of FORK branches that are missing fields.
For example:
ROW a = [1, 2, 3], b = "foo"
| MV_EXPAND a
| FORK (WHERE true | LIMIT 2)
(STATS x = count(*))
(WHERE a % 2 == 0 | EVAL y = 2)
| SORT _fork, a
| KEEP _fork, a, b, x, y
Will produce:
_fork | a | b | x | y
---------------+---------------+---------------+---------------+---------------
fork1 |1 |foo |null |null
fork1 |2 |foo |null |null
fork2 |null |null |3 |null
fork3 |2 |foo |null |2
I think fixed the field resolution for this case.
In this test example here we should ask for all fields, since not all branches are bounded by an Aggregate
or Project
.
We could consider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Please, also, add that query I mentioned in one of my comments: FROM employees | FORK ( WHERE true | stats min(salary) by gender) ( WHERE true | LIMIT 3 )
to IndexResolverFieldNamesTests
. Thank you.
Should close #127208
Part of #121950
AFAICS we don't need to ask for all fields when FORK is used and the current field resolution should work since FORK is a N-ary plan.