-
Notifications
You must be signed in to change notification settings - Fork 25.3k
[DOCS] Add high-level guide for kNN search #80857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pinging @elastic/es-docs (Team:Docs) |
Pinging @elastic/es-search (Team:Search) |
* Product recommendations and recommendation engines | ||
* Similarity search for images or videos | ||
* Outlier detection and classification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@giladgal Let me know if we want to highlight any other use cases here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrodewig Thanks for your work! This looks good overall, I've left a couple of small comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for performing the changes. LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrodewig Thank you for iterating, this PR LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jrodewig, this is a nice addition! I like the terminology and overall consistency (I noticed we always stick with "similarity" instead of "distance", for example).
* Relevance ranking based on natural language processing (NLP) algorithms | ||
* Product recommendations and recommendation engines | ||
* Similarity search for images or videos | ||
* Outlier detection and classification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I would omit "outlier detection and classification", since it's really not a use case we've designed for or thought deeply about yet.
exchange for slower searches. | ||
|
||
Exact, brute-force kNN guarantees accurate results but doesn't scale well with | ||
large, unfiltered datasets. The vector function has to scan each matched |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saying "The vector function has to scan..." feels a little confusing -- I think of the "query" as having to do the scan, and for each document it applies a vector function. Maybe we could say "The script_score
query scans each matched document and computes the vector function, which can result in slow search speeds."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this paragraph to:
Exact, brute-force kNN guarantees accurate results but doesn’t scale well with large, unfiltered datasets. With this approach, a
script_score
query must scan each matched document to compute the vector function, which can result in slow search speeds. However, you can improve latency by using Query DSL to limit the number of matched documents passed to the function. If you filter your data to a small subset of documents, you can get good search performance using this approach.
[[knn-prereqs]] | ||
=== Prerequisites | ||
|
||
* To run a kNN search, you must be able to convert your data into meaningful |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a little confusing, because we talk about dense_vector
fields but then mention passing them to a query (which doesn't really make sense). Maybe this could be phrased something like this...
"To run a kNN search, you must be able to convert your data into meaningful vector values. You create these vectors outside of {es} and add them to documents through the <<dense-vector,dense_vector
>> field. Queries must also be represented as vectors with the same dimension. The vectors should be designed so that the closer a document is to the query vector according to the similarity metric, the better its match."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I updated this to:
To run a kNN search, you must be able to convert your data into meaningful vector values. You create these vectors outside of Elasticsearch and add them to documents as
dense_vector
field values. Queries are represented as vectors with the same dimension.Design your vectors so that the closer a document’s vector is to a query vector, based on a similarity metric, the better its match.
|
||
[discrete] | ||
[[approximate-knn]] | ||
=== Approximate kNN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The phrase and acronym the community uses for this is "approximate nearest neighbor (ANN) search". However we called the endpoint _knn_search
, so I understand the motivation here and think this is a good compromise.
[[approximate-knn-limitations]] | ||
==== Limitations for approximate kNN search | ||
|
||
* You can't currently use Query DSL to filter the documents on which an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just say "You can't currently use Query DSL to filter the documents on which an
approximate kNN runs." ? I'm not sure what is meant by "... or the results of an approximate kNN search".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was my attempt to address pre-filtering and post-filtering. But since neither is supported, we can probably just simply state that. I updated this bullet to:
You can't currently use Query DSL to filter documents for an approximate kNN search. If you need to filter the documents, consider using exact kNN instead.
|
||
[discrete] | ||
[[approximate-knn-limitations]] | ||
==== Limitations for approximate kNN search |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some other limitations -- it currently doesn't work with filtered index aliases or nested documents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. I'll add them.
|
||
* You can't currently use Query DSL to filter the documents on which an | ||
approximate kNN runs or the results of an approximate kNN search. If you need to | ||
restrictively filter the documents on which a kNN search runs, consider using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we should say "restrictively filter" instead of just "filter". From my perspective, it currently does not make sense to perform ANN with any filter, even if it is non-restrictive. It is very hard for a user to implement correctly -- with the straightforward approach (postfiltering), they could easily end up with fewer than k results even when k are available.
[[tune-approximate-knn-for-speed-accuracy]] | ||
==== Tune approximate kNN for speed or accuracy | ||
|
||
For faster searches, the kNN search API collects `num_candidates` results from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me the "For faster searches..." phrasing feels confusing -- what are we comparing against? The description also isn't totally accurate, since we filter down to k
before merging results across shards.
Here's an idea for rewording: "On each shard, the kNN search API first finds an approximate set of nearest neighbor candidates of size num_candidates
. It then computes the true vector similarity to each candidate, and selects the closest k
. Finally, the best k
results from each shard are merged together to find a global top k
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I adapted it to:
To gather results, the kNN search API finds a
num_candidates
number of approximate nearest neighbor candidates on each shard. The search computes the similarity of each shard’s candidate vectors to the query vector, selecting thek
most similar results from each shard. The search then merges the results from each shard to return the global topk
nearest neighbors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me!
* You can't run an approximate kNN search on a `dense_vector` field within a | ||
<<nested,`nested`>> mapping. | ||
|
||
* You can't currently use Query DSL to filter documents for an approximate kNN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same small comment, should this be "the Query DSL" ?
Exact, brute-force kNN guarantees accurate results but doesn't scale well with | ||
large, unfiltered datasets. With this approach, a `script_score` query must scan | ||
each matched document to compute the vector function, which can result in slow | ||
search speeds. However, you can improve latency by using Query DSL to limit the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super small comment, should this be "the Query DSL" and have a link?
Thanks @jtibshirani! |
Adds a high-level guide for running an approximate or exact kNN search in Elasticsearch.
Preview
https://elasticsearch_80857.docs-preview.app.elstc.co/guide/en/elasticsearch/reference/master/knn-search.html