Skip to content

Update dense_vector docs with kNN indexing options #80306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Nov 4, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 3 additions & 7 deletions docs/reference/mapping/params/index-options.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,9 @@
=== `index_options`

The `index_options` parameter controls what information is added to the
inverted index for search and highlighting purposes.

[WARNING]
====
The `index_options` parameter is intended for use with <<text,`text`>> fields
only. Avoid using `index_options` with other field data types.
====
inverted index for search and highlighting purposes. Only term-based field
types like <<text,`text`>> and <<keyword,`keyword`>> support this
configuration.

The parameter accepts one of the following values. Each value retrieves
information from the previous listed values. For example, `freqs` contains
Expand Down
10 changes: 5 additions & 5 deletions docs/reference/mapping/params/similarity.asciidoc
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
[[similarity]]
=== `similarity`

Elasticsearch allows you to configure a scoring algorithm or _similarity_ per
field. The `similarity` setting provides a simple way of choosing a similarity
algorithm other than the default `BM25`, such as `boolean`.
{es} allows you to configure a text scoring algorithm or _similarity_
per field. The `similarity` setting provides a simple way of choosing a
text similarity algorithm other than the default `BM25`, such as `boolean`.

Similarities are mostly useful for <<text,`text`>> fields, but can also apply
to other field types.
Only text-based field types like <<text,`text`>> and <<keyword,`keyword`>>
support this configuration.

Custom similarities can be configured by tuning the parameters of the built-in
similarities. For more details about this expert options, see the
Expand Down
137 changes: 129 additions & 8 deletions docs/reference/mapping/types/dense-vector.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,14 @@
<titleabbrev>Dense vector</titleabbrev>
++++

A `dense_vector` field stores dense vectors of float values.
The maximum number of dimensions that can be in a vector should
not exceed 2048. A `dense_vector` field is a single-valued field.
The `dense_vector` field type stores dense vectors of float values.

`dense_vector` fields do not support querying, sorting or aggregating. They can
only be accessed in scripts through the dedicated <<vector-functions,vector functions>>.
You can use `dense_vector` fields in
<<query-dsl-script-score-query,`script_score`>> queries to score documents.
They can also be indexed to support efficient k-nearest neighbor search. Dense
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k-nearest neighbor search
Note from @jtibshirani: Link this to the knn search API docs when they're up (in a follow-up PR).

Sorry for accidentally deleting this. I was deleting some of my own comments from a pending review and accidentally clicked the wrong one. 🤦

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

efficient k-nearest neighbor search

The link will help, but this feels a little cryptic to me. Should we mention the kNN search API directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's unclear right now. I'll improve this in the follow-up PR where I document the kNN search API.

vector fields do not support aggregations, sorting, or other query types.

You index a dense vector as an array of floats.
You add a `dense_vector` field as an array of floats:

[source,console]
--------------------------------------------------
Expand All @@ -23,7 +23,7 @@ PUT my-index-000001
"properties": {
"my_vector": {
"type": "dense_vector",
"dims": 3 <1>
"dims": 3
},
"my_text" : {
"type" : "keyword"
Expand All @@ -46,4 +46,125 @@ PUT my-index-000001/_doc/2

--------------------------------------------------

<1> dims – the number of dimensions in the vector, required parameter.
NOTE: Unlike most other data types, dense vectors are always single-valued.
It is not possible to store multiple values in one `dense_vector` field.

[[index-vectors-knn-search]]
==== Index vectors for kNN search

experimental::[]

A _k-nearest neighbor_ (kNN) search finds the _k_ nearest
vectors to a query vector, as measured by a similarity metric.

Dense vector fields can be used to rank documents in
<<query-dsl-script-score-query,`script_score` queries>>. This lets you perform
a brute-force kNN search by scanning all documents and ranking them by
similarity.

In many cases, a brute-force kNN search is not efficient enough. For this
reason, the `dense_vector` type supports indexing vectors into a specialized
data structure to support fast kNN search. You can enable indexing through the
`index` parameter:

[source,console]
--------------------------------------------------
PUT my-index-000002
{
"mappings": {
"properties": {
"my_vector": {
"type": "dense_vector",
"dims": 3,
"index": true,
"similarity": "dot_product" <1>
}
}
}
}
--------------------------------------------------
<1> When `index` is enabled, you must define the vector similarity to use in kNN search

{es} uses the https://arxiv.org/abs/1603.09320[HNSW algorithm] to
support efficient kNN search. Like most kNN algorithms, HNSW is an approximate
method that sacrifices result accuracy for improved speed.

NOTE: Indexing vectors for approximate kNN search is an expensive process. It can take
substantial time to ingest documents that contain vector fields with `index`
enabled.

[role="child_attributes"]
[[dense-vector-params]]
==== Parameters for dense vector fields
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I struggled with the formatting a bit here -- any corrections/ suggestions are appreciated. The other field types use the [horizontal] styling, but I couldn't get this to work with the sublists and code snippets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should update the other field types to remove the [horizontal] attribute and use formatting similar to the parameter definitions in our API docs.

I left a few comments to:

  • Add collapsible sections for nested value/properties.
  • Add required and data type for each parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I went through and accepted all of these (batched into a new commit).


The following mapping parameters are accepted:

`dims`::
(Required, integer)
Number of vector dimensions. Can't exceed `2048`.

`index`::
(Optional, Boolean)
If `true`, you can search this field using the kNN search API. Defaults to
`false`.

`similarity`::
(Required^*^, string)
The vector similarity metric to use in kNN search. Documents are ranked by
their vector field's similarity to the query vector. The `_score` of each
document will be derived from the similarity, in a way that ensures scores are
positive and that a larger score corresponds to a higher ranking.
+
^*^ If `index` is `true`, this parameter is required.
+
.Valid values for `similarity`
[%collapsible%open]
====
`l2_norm`:::
Computes similarity based on the L^2^ distance (also known as Euclidean
distance) between the vectors. The document `_score` is computed as
`1 / (1 + l2_norm(query, vector)^2)`.`

`dot_product`:::
Computes the dot product of two vectors. This option provides an optimized way
to perform cosine similarity. In order to use it, all vectors must be of unit
length, including both document and query vectors. The document `_score` is
computed as `(1 + dot_product(query, vector)) / 2`.

`cosine`:::
Computes the cosine similarity. Note that the most efficient way to perform
cosine similarity is to normalize all vectors to unit length, and instead use
`dot_product`. You should only use `cosine` if you need to preserve the
original vectors and cannot normalize them in advance. The document `_score`
is computed as `(1 + cosine(query, vector)) / 2`.
====

NOTE: Although they are conceptually related, the `similarity` parameter is
different from <<text,`text`>> field <<similarity,`similarity`>> and accepts
a distinct set of options.

`index_options`::
(Optional, object)
An optional section that configures the kNN indexing algorithm. The HNSW
algorithm has two internal parameters that influence how the data structure is
built. These can be adjusted to improve the accuracy of results, at the
expense of slower indexing speed. When `index_options` is provided, all of its
properties must be defined.
+
.Properties of `index_options`
[%collapsible%open]
====
`type`:::
(Required, string)
The type of kNN algorithm to use. Currently only `hnsw` is supported.

`m`:::
(Required, integer)
The number of neighbors each node will be connected to in the HNSW graph.
Defaults to `16`.

`ef_construction`:::
(Required, integer)
The number of candidates to track while assembling the list of nearest
neighbors for each new node. Defaults to `100`.
====