Skip to content

[DOCS] Add high-level guide for kNN search #80857

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Nov 30, 2021
Merged

[DOCS] Add high-level guide for kNN search #80857

merged 9 commits into from
Nov 30, 2021

Conversation

jrodewig
Copy link
Contributor

@jrodewig jrodewig commented Nov 18, 2021

Adds a high-level guide for running an approximate or exact kNN search in Elasticsearch.

Preview

https://elasticsearch_80857.docs-preview.app.elstc.co/guide/en/elasticsearch/reference/master/knn-search.html

@jrodewig jrodewig mentioned this pull request Nov 18, 2021
17 tasks
@jrodewig jrodewig added :Search/Search Search-related issues that do not fall into other categories >docs General docs changes v8.0.0-beta1 labels Nov 19, 2021
@jrodewig jrodewig marked this pull request as ready for review November 19, 2021 18:13
@elasticmachine elasticmachine added Team:Docs Meta label for docs team Team:Search Meta label for search team labels Nov 19, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-docs (Team:Docs)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

Comment on lines 18 to 20
* Product recommendations and recommendation engines
* Similarity search for images or videos
* Outlier detection and classification
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giladgal Let me know if we want to highlight any other use cases here.

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrodewig Thanks for your work! This looks good overall, I've left a couple of small comments.

@jrodewig
Copy link
Contributor Author

Hi @giladgal. I pushed 67591d5 to address some of the feedback we discussed. Let me know if any other changes are needed. Thanks!

Copy link
Contributor

@giladgal giladgal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for performing the changes. LGTM.

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrodewig Thank you for iterating, this PR LGTM!

Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jrodewig, this is a nice addition! I like the terminology and overall consistency (I noticed we always stick with "similarity" instead of "distance", for example).

* Relevance ranking based on natural language processing (NLP) algorithms
* Product recommendations and recommendation engines
* Similarity search for images or videos
* Outlier detection and classification
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I would omit "outlier detection and classification", since it's really not a use case we've designed for or thought deeply about yet.

exchange for slower searches.

Exact, brute-force kNN guarantees accurate results but doesn't scale well with
large, unfiltered datasets. The vector function has to scan each matched
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saying "The vector function has to scan..." feels a little confusing -- I think of the "query" as having to do the scan, and for each document it applies a vector function. Maybe we could say "The script_score query scans each matched document and computes the vector function, which can result in slow search speeds."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this paragraph to:

Exact, brute-force kNN guarantees accurate results but doesn’t scale well with large, unfiltered datasets. With this approach, a script_score query must scan each matched document to compute the vector function, which can result in slow search speeds. However, you can improve latency by using Query DSL to limit the number of matched documents passed to the function. If you filter your data to a small subset of documents, you can get good search performance using this approach.

[[knn-prereqs]]
=== Prerequisites

* To run a kNN search, you must be able to convert your data into meaningful
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a little confusing, because we talk about dense_vector fields but then mention passing them to a query (which doesn't really make sense). Maybe this could be phrased something like this...

"To run a kNN search, you must be able to convert your data into meaningful vector values. You create these vectors outside of {es} and add them to documents through the <<dense-vector,dense_vector>> field. Queries must also be represented as vectors with the same dimension. The vectors should be designed so that the closer a document is to the query vector according to the similarity metric, the better its match."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I updated this to:

To run a kNN search, you must be able to convert your data into meaningful vector values. You create these vectors outside of Elasticsearch and add them to documents as dense_vector field values. Queries are represented as vectors with the same dimension.

Design your vectors so that the closer a document’s vector is to a query vector, based on a similarity metric, the better its match.


[discrete]
[[approximate-knn]]
=== Approximate kNN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phrase and acronym the community uses for this is "approximate nearest neighbor (ANN) search". However we called the endpoint _knn_search, so I understand the motivation here and think this is a good compromise.

[[approximate-knn-limitations]]
==== Limitations for approximate kNN search

* You can't currently use Query DSL to filter the documents on which an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just say "You can't currently use Query DSL to filter the documents on which an
approximate kNN runs." ? I'm not sure what is meant by "... or the results of an approximate kNN search".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my attempt to address pre-filtering and post-filtering. But since neither is supported, we can probably just simply state that. I updated this bullet to:

You can't currently use Query DSL to filter documents for an approximate kNN search. If you need to filter the documents, consider using exact kNN instead.


[discrete]
[[approximate-knn-limitations]]
==== Limitations for approximate kNN search
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some other limitations -- it currently doesn't work with filtered index aliases or nested documents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I'll add them.


* You can't currently use Query DSL to filter the documents on which an
approximate kNN runs or the results of an approximate kNN search. If you need to
restrictively filter the documents on which a kNN search runs, consider using
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should say "restrictively filter" instead of just "filter". From my perspective, it currently does not make sense to perform ANN with any filter, even if it is non-restrictive. It is very hard for a user to implement correctly -- with the straightforward approach (postfiltering), they could easily end up with fewer than k results even when k are available.

[[tune-approximate-knn-for-speed-accuracy]]
==== Tune approximate kNN for speed or accuracy

For faster searches, the kNN search API collects `num_candidates` results from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me the "For faster searches..." phrasing feels confusing -- what are we comparing against? The description also isn't totally accurate, since we filter down to k before merging results across shards.

Here's an idea for rewording: "On each shard, the kNN search API first finds an approximate set of nearest neighbor candidates of size num_candidates. It then computes the true vector similarity to each candidate, and selects the closest k. Finally, the best k results from each shard are merged together to find a global top k.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I adapted it to:

To gather results, the kNN search API finds a num_candidates number of approximate nearest neighbor candidates on each shard. The search computes the similarity of each shard’s candidate vectors to the query vector, selecting the k most similar results from each shard. The search then merges the results from each shard to return the global top k nearest neighbors.

Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me!

* You can't run an approximate kNN search on a `dense_vector` field within a
<<nested,`nested`>> mapping.

* You can't currently use Query DSL to filter documents for an approximate kNN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same small comment, should this be "the Query DSL" ?

Exact, brute-force kNN guarantees accurate results but doesn't scale well with
large, unfiltered datasets. With this approach, a `script_score` query must scan
each matched document to compute the vector function, which can result in slow
search speeds. However, you can improve latency by using Query DSL to limit the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super small comment, should this be "the Query DSL" and have a link?

@jrodewig
Copy link
Contributor Author

Thanks @jtibshirani!

@jrodewig jrodewig merged commit 229d2d7 into elastic:master Nov 30, 2021
@jrodewig jrodewig deleted the docs__knn-guide branch November 30, 2021 19:17
elasticsearchmachine pushed a commit that referenced this pull request Nov 30, 2021
Adds a high-level guide for running an approximate or exact kNN search in Elasticsearch.

Relates to #78473.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>docs General docs changes :Search/Search Search-related issues that do not fall into other categories Team:Docs Meta label for docs team Team:Search Meta label for search team v8.0.0-beta1 v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants