Skip to content

Commit 886db84

Browse files
authored
Expose Lucene's FeatureField. (#30618)
Lucene has a new `FeatureField` which gives the ability to record numeric features as term frequencies. Its main benefit is that it allows to boost queries with the values of these features and efficiently skip non-competitive documents at the same time using block-max WAND and indexed impacts.
1 parent 739bb4f commit 886db84

File tree

14 files changed

+1616
-2
lines changed

14 files changed

+1616
-2
lines changed

docs/reference/mapping/types.asciidoc

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,8 @@ string:: <<text,`text`>> and <<keyword,`keyword`>>
4040

4141
<<parent-join>>:: Defines parent/child relation for documents within the same index
4242

43+
<<feature>>:: Record numeric features to boost hits at query time.
44+
4345
[float]
4446
=== Multi-fields
4547

@@ -86,6 +88,6 @@ include::types/percolator.asciidoc[]
8688

8789
include::types/parent-join.asciidoc[]
8890

89-
91+
include::types/feature.asciidoc[]
9092

9193

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
[[feature]]
2+
=== Feature datatype
3+
4+
A `feature` field can index numbers so that they can later be used to boost
5+
documents in queries with a <<query-dsl-feature-query,`feature`>> query.
6+
7+
[source,js]
8+
--------------------------------------------------
9+
PUT my_index
10+
{
11+
"mappings": {
12+
"_doc": {
13+
"properties": {
14+
"pagerank": {
15+
"type": "feature" <1>
16+
},
17+
"url_length": {
18+
"type": "feature",
19+
"positive_score_impact": false <2>
20+
}
21+
}
22+
}
23+
}
24+
}
25+
26+
PUT my_index/_doc/1
27+
{
28+
"pagerank": 8,
29+
"url_length": 22
30+
}
31+
32+
GET my_index/_search
33+
{
34+
"query": {
35+
"feature": {
36+
"field": "pagerank"
37+
}
38+
}
39+
}
40+
--------------------------------------------------
41+
// CONSOLE
42+
<1> Feature fields must use the `feature` field type
43+
<2> Features that correlate negatively with the score need to declare it
44+
45+
NOTE: `feature` fields only support single-valued fields and strictly positive
46+
values. Multi-valued fields and negative values will be rejected.
47+
48+
NOTE: `feature` fields do not support querying, sorting or aggregating. They may
49+
only be used within <<query-dsl-feature-query,`feature`>> queries.
50+
51+
NOTE: `feature` fields only preserve 9 significant bits for the precision, which
52+
translates to a relative error of about 0.4%.
53+
54+
Features that correlate negatively with the score should set
55+
`positive_score_impact` to `false` (defaults to `true`). This will be used by
56+
the <<query-dsl-feature-query,`feature`>> query to modify the scoring formula
57+
in such a way that the score decreases with the value of the feature instead of
58+
increasing. For instance in web search, the url length is a commonly used
59+
feature which correlates negatively with scores.
Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
[[query-dsl-feature-query]]
2+
=== Feature Query
3+
4+
The `feature` query is a specialized query that only works on
5+
<<feature,`feature`>> fields. Its goal is to boost the score of documents based
6+
on the values of numeric features. It is typically put in a `should` clause of
7+
a <<query-dsl-bool-query,`bool`>> query so that its score is added to the score
8+
of the query.
9+
10+
Compared to using <<query-dsl-function-score-query,`function_score`>> or other
11+
ways to modify the score, this query has the benefit of being able to
12+
efficiently skip non-competitive hits when
13+
<<search-uri-request,`track_total_hits`>> is set to `false`. Speedups may be
14+
spectacular.
15+
16+
Here is an example:
17+
18+
[source,js]
19+
--------------------------------------------------
20+
PUT test
21+
{
22+
"mappings": {
23+
"_doc": {
24+
"properties": {
25+
"pagerank": {
26+
"type": "feature"
27+
},
28+
"url_length": {
29+
"type": "feature",
30+
"positive_score_impact": false
31+
}
32+
}
33+
}
34+
}
35+
}
36+
37+
PUT test/_doc/1
38+
{
39+
"pagerank": 10,
40+
"url_length": 50
41+
}
42+
43+
PUT test/_doc/2
44+
{
45+
"pagerank": 100,
46+
"url_length": 20
47+
}
48+
49+
POST test/_refresh
50+
51+
GET test/_search
52+
{
53+
"query": {
54+
"feature": {
55+
"field": "pagerank"
56+
}
57+
}
58+
}
59+
60+
GET test/_search
61+
{
62+
"query": {
63+
"feature": {
64+
"field": "url_length"
65+
}
66+
}
67+
}
68+
--------------------------------------------------
69+
// CONSOLE
70+
71+
[float]
72+
=== Supported functions
73+
74+
The `feature` query supports 3 functions in order to boost scores using the
75+
values of features. If you do not know where to start, we recommend that you
76+
start with the `saturation` function, which is the default when no function is
77+
provided.
78+
79+
[float]
80+
==== Saturation
81+
82+
This function gives a score that is equal to `S / (S + pivot)` where `S` is the
83+
value of the feature and `pivot` is a configurable pivot value so that the
84+
result will be less than +0.5+ if `S` is less than pivot and greater than +0.5+
85+
otherwise. Scores are always is +(0, 1)+.
86+
87+
If the feature has a negative score impact then the function will be computed as
88+
`pivot / (S + pivot)`, which decreases when `S` increases.
89+
90+
[source,js]
91+
--------------------------------------------------
92+
GET test/_search
93+
{
94+
"query": {
95+
"feature": {
96+
"field": "pagerank",
97+
"saturation": {
98+
"pivot": 8
99+
}
100+
}
101+
}
102+
}
103+
--------------------------------------------------
104+
// CONSOLE
105+
// TEST[continued]
106+
107+
If +pivot+ is not supplied then Elasticsearch will compute a default value that
108+
will be approximately equal to the geometric mean of all feature values that
109+
exist in the index. We recommend this if you haven't had the opportunity to
110+
train a good pivot value.
111+
112+
[source,js]
113+
--------------------------------------------------
114+
GET test/_search
115+
{
116+
"query": {
117+
"feature": {
118+
"field": "pagerank",
119+
"saturation": {}
120+
}
121+
}
122+
}
123+
--------------------------------------------------
124+
// CONSOLE
125+
// TEST[continued]
126+
127+
[float]
128+
==== Logarithm
129+
130+
This function gives a score that is equal to `log(scaling_factor + S)` where
131+
`S` is the value of the feature and `scaling_factor` is a configurable scaling
132+
factor. Scores are unbounded.
133+
134+
This function only supports features that have a positive score impact.
135+
136+
[source,js]
137+
--------------------------------------------------
138+
GET test/_search
139+
{
140+
"query": {
141+
"feature": {
142+
"field": "pagerank",
143+
"log": {
144+
"scaling_factor": 4
145+
}
146+
}
147+
}
148+
}
149+
--------------------------------------------------
150+
// CONSOLE
151+
// TEST[continued]
152+
153+
[float]
154+
==== Sigmoid
155+
156+
This function is an extension of `saturation` which adds a configurable
157+
exponent. Scores are computed as `S^exp^ / (S^exp^ + pivot^exp^)`. Like for the
158+
`saturation` function, `pivot` is the value of `S` that gives a score of +0.5+
159+
and scores are in +(0, 1)+.
160+
161+
`exponent` must be positive, but is typically in +[0.5, 1]+. A good value should
162+
be computed via traning. If you don't have the opportunity to do so, we recommend
163+
that you stick to the `saturation` function instead.
164+
165+
[source,js]
166+
--------------------------------------------------
167+
GET test/_search
168+
{
169+
"query": {
170+
"feature": {
171+
"field": "pagerank",
172+
"sigmoid": {
173+
"pivot": 7,
174+
"exponent": 0.6
175+
}
176+
}
177+
}
178+
}
179+
--------------------------------------------------
180+
// CONSOLE
181+
// TEST[continued]

docs/reference/query-dsl/special-queries.asciidoc

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,11 @@ This query allows a script to act as a filter. Also see the
1919
This query finds queries that are stored as documents that match with
2020
the specified document.
2121

22+
<<query-dsl-feature-query,`feature` query>>::
23+
24+
A query that computes scores based on the values of numeric features and is
25+
able to efficiently skip non-competitive hits.
26+
2227
<<query-dsl-wrapper-query,`wrapper` query>>::
2328

2429
A query that accepts other queries as json or yaml string.
@@ -29,4 +34,6 @@ include::script-query.asciidoc[]
2934

3035
include::percolate-query.asciidoc[]
3136

37+
include::feature-query.asciidoc[]
38+
3239
include::wrapper-query.asciidoc[]

0 commit comments

Comments
 (0)