-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Update dense_vector
docs with kNN indexing options
#80306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
588a2fa
ead5b40
23c323f
7426cce
8ac3e01
0a8670a
7eefd95
324af52
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,14 +6,14 @@ | |
<titleabbrev>Dense vector</titleabbrev> | ||
++++ | ||
|
||
A `dense_vector` field stores dense vectors of float values. | ||
The maximum number of dimensions that can be in a vector should | ||
not exceed 2048. A `dense_vector` field is a single-valued field. | ||
The `dense_vector` field type stores dense vectors of float values. | ||
|
||
`dense_vector` fields do not support querying, sorting or aggregating. They can | ||
only be accessed in scripts through the dedicated <<vector-functions,vector functions>>. | ||
You can use `dense_vector` fields in | ||
<<query-dsl-script-score-query,`script_score`>> queries to score documents. | ||
They can also be indexed to support efficient k-nearest neighbor search. Dense | ||
vector fields do not support aggregations, sorting, or other query types. | ||
|
||
You index a dense vector as an array of floats. | ||
You add a `dense_vector` field as an array of floats: | ||
|
||
[source,console] | ||
-------------------------------------------------- | ||
|
@@ -23,7 +23,7 @@ PUT my-index-000001 | |
"properties": { | ||
"my_vector": { | ||
"type": "dense_vector", | ||
"dims": 3 <1> | ||
"dims": 3 | ||
}, | ||
"my_text" : { | ||
"type" : "keyword" | ||
|
@@ -46,4 +46,125 @@ PUT my-index-000001/_doc/2 | |
|
||
-------------------------------------------------- | ||
|
||
<1> dims – the number of dimensions in the vector, required parameter. | ||
NOTE: Unlike most other data types, dense vectors are always single-valued. | ||
It is not possible to store multiple values in one `dense_vector` field. | ||
|
||
[[index-vectors-knn-search]] | ||
==== Index vectors for kNN search | ||
|
||
experimental::[] | ||
|
||
A _k-nearest neighbor_ (kNN) search finds the _k_ nearest | ||
vectors to a query vector, as measured by a similarity metric. | ||
|
||
Dense vector fields can be used to rank documents in | ||
<<query-dsl-script-score-query,`script_score` queries>>. This lets you perform | ||
a brute-force kNN search by scanning all documents and ranking them by | ||
similarity. | ||
|
||
In many cases, a brute-force kNN search is not efficient enough. For this | ||
reason, the `dense_vector` type supports indexing vectors into a specialized | ||
data structure to support fast kNN search. You can enable indexing through the | ||
`index` parameter: | ||
|
||
[source,console] | ||
-------------------------------------------------- | ||
PUT my-index-000002 | ||
{ | ||
"mappings": { | ||
"properties": { | ||
"my_vector": { | ||
"type": "dense_vector", | ||
"dims": 3, | ||
"index": true, | ||
"similarity": "dot_product" <1> | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
<1> When `index` is enabled, you must define the vector similarity to use in kNN search | ||
|
||
{es} uses the https://arxiv.org/abs/1603.09320[HNSW algorithm] to | ||
support efficient kNN search. Like most kNN algorithms, HNSW is an approximate | ||
method that sacrifices result accuracy for improved speed. | ||
|
||
NOTE: Indexing vectors for approximate kNN search is an expensive process. It can take | ||
substantial time to ingest documents that contain vector fields with `index` | ||
enabled. | ||
|
||
[role="child_attributes"] | ||
[[dense-vector-params]] | ||
==== Parameters for dense vector fields | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I struggled with the formatting a bit here -- any corrections/ suggestions are appreciated. The other field types use the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should update the other field types to remove the I left a few comments to:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks! I went through and accepted all of these (batched into a new commit). |
||
|
||
The following mapping parameters are accepted: | ||
|
||
`dims`:: | ||
(Required, integer) | ||
Number of vector dimensions. Can't exceed `2048`. | ||
|
||
`index`:: | ||
(Optional, Boolean) | ||
If `true`, you can search this field using the kNN search API. Defaults to | ||
`false`. | ||
|
||
`similarity`:: | ||
(Required^*^, string) | ||
The vector similarity metric to use in kNN search. Documents are ranked by | ||
their vector field's similarity to the query vector. The `_score` of each | ||
document will be derived from the similarity, in a way that ensures scores are | ||
positive and that a larger score corresponds to a higher ranking. | ||
+ | ||
^*^ If `index` is `true`, this parameter is required. | ||
+ | ||
.Valid values for `similarity` | ||
[%collapsible%open] | ||
==== | ||
`l2_norm`::: | ||
Computes similarity based on the L^2^ distance (also known as Euclidean | ||
distance) between the vectors. The document `_score` is computed as | ||
`1 / (1 + l2_norm(query, vector)^2)`.` | ||
|
||
`dot_product`::: | ||
Computes the dot product of two vectors. This option provides an optimized way | ||
to perform cosine similarity. In order to use it, all vectors must be of unit | ||
length, including both document and query vectors. The document `_score` is | ||
computed as `(1 + dot_product(query, vector)) / 2`. | ||
|
||
`cosine`::: | ||
Computes the cosine similarity. Note that the most efficient way to perform | ||
cosine similarity is to normalize all vectors to unit length, and instead use | ||
`dot_product`. You should only use `cosine` if you need to preserve the | ||
original vectors and cannot normalize them in advance. The document `_score` | ||
is computed as `(1 + cosine(query, vector)) / 2`. | ||
==== | ||
|
||
jtibshirani marked this conversation as resolved.
Show resolved
Hide resolved
|
||
NOTE: Although they are conceptually related, the `similarity` parameter is | ||
different from <<text,`text`>> field <<similarity,`similarity`>> and accepts | ||
jtibshirani marked this conversation as resolved.
Show resolved
Hide resolved
|
||
a distinct set of options. | ||
|
||
`index_options`:: | ||
(Optional, object) | ||
An optional section that configures the kNN indexing algorithm. The HNSW | ||
algorithm has two internal parameters that influence how the data structure is | ||
built. These can be adjusted to improve the accuracy of results, at the | ||
expense of slower indexing speed. When `index_options` is provided, all of its | ||
properties must be defined. | ||
+ | ||
.Properties of `index_options` | ||
[%collapsible%open] | ||
==== | ||
`type`::: | ||
(Required, string) | ||
The type of kNN algorithm to use. Currently only `hnsw` is supported. | ||
|
||
`m`::: | ||
jtibshirani marked this conversation as resolved.
Show resolved
Hide resolved
|
||
(Required, integer) | ||
The number of neighbors each node will be connected to in the HNSW graph. | ||
Defaults to `16`. | ||
|
||
`ef_construction`::: | ||
(Required, integer) | ||
The number of candidates to track while assembling the list of nearest | ||
neighbors for each new node. Defaults to `100`. | ||
==== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for accidentally deleting this. I was deleting some of my own comments from a pending review and accidentally clicked the wrong one. 🤦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link will help, but this feels a little cryptic to me. Should we mention the kNN search API directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's unclear right now. I'll improve this in the follow-up PR where I document the kNN search API.