Description
Original issue: #49028
Feature branch: field-retrieval
Docs: https://www.elastic.co/guide/en/elasticsearch/reference/7.x/search-fields.html
Motivation
Often a user wants to retrieve a particular set of fields during a search. Currently, we don't support this usage pattern in a good way. In short, given a list of fields, there is no easy way to load all of their values:
- We can’t load all of them from doc values. Some fields like text fields may not have doc values at all, or we may exceed the limit for a reasonable number of doc value fields to load.
- It’s not easy to load all of them through source. For example, if the field is a field alias, it’s difficult to determine where to find its value in the source.
Better field retrieval support is becoming even more important now that we're introducing more field types that don’t fit the typical pattern like constant_keyword
and the proposed runtime fields (#48063).
Feature Summary
We plan to add a new fields
section to the search request, which users would specify instead of using source filtering to load fields from source:
POST logs-*/_search
{
"query": { "match_all": {} },
"fields": [
"file.*",
{
"field": "event.timestamp",
"format": "epoch_millis"
},
...
]
}
Both full field names and wildcard patterns are accepted. Only leaf fields are returned, the API will not allow for fetching object values. The fields are returned as a flat list in the fields
section in each hit, the same as we do for docvalue_fields
and script_fields
.
Overall, the API gives a friendly way to load fields from source:
- If a non-standard field like a field alias, multi-field, or constant_keyword is specified in
fields
, then we’ll consult the mappings to find and return the right value. - The fields are returned in a flat list, as opposed to structured JSON.
- For date and numeric field types, we would support the same
format
parameter as we do fordocvalue_fields
to allow for adjusting the format of the results. - Each value would be returned in a 'canonical' format -- for example if a field is mapped as an integer, it will be returned as an integer even if it was specified as a string in the _source.
Some clarifications:
- In this first pass, the API will not attempt to load from stored fields or doc values.
- For simplicity of parsing, values will always be returned in an array, even if there is only one value present.
Implementation Plan
- Introduce a
fields
section in the search request that fetches values from source. (Add a simple 'fetch fields' phase. #55639) - Correctly resolve field aliases, multi-fields, and copy_to. (In field retrieval API, handle non-standard source paths. #55889)
- Consult field mappings to parse and correctly format each value. Also handle constant_keyword. (Allow field mappers to retrieve fields from source. #56928)
- Handle
ignore_malformed
. (Allow field mappers to retrieve fields from source. #56928) - Support
ignore_above
. (Fix casting of scaled_float in sorts (backport of #57207) #57385) - Support setting a format through the
format
parameter. (Deprecte Rounding#round (backport #57845) #57893) - Measure performance and look into improvements. (For the fields fetch phase, avoid reloading stored fields. #58196)
- Support
null_value
. (Respect null_value parameter in the fields retrieval API. #58623) - Improve documentation around field loading. (Add a reference on returning fields during a search. #57500, Add docs for the fields retrieval API. #58787)
- Handle geo values. (Support spatial fields in field retrieval API. #59821)
Future improvements:
- Move
FieldMapper#lookupValues
toMappedFieldType
. (?) - Handle meta fields like
_size
. - Make use of more efficient source parsing: Partially parse
source
documents to speed upsource
access #52591. - Support the API in
inner_hits
.
Open Questions
- If a wildcard pattern matches both a parent field and one of its multi-fields, should we just return the parent to avoid returning the same value twice? A similar question holds for field aliases and their target fields.
- Should the API return fields in
_source
that have been disabled in the mappings (enabled: false
)? - For
keyword
fields, should we apply thenormalizer
or return the original value?