[DOCS] Add high-level guide for kNN search #80857

jrodewig · 2021-11-18T22:22:05Z

Adds a high-level guide for running an approximate or exact kNN search in Elasticsearch.

Preview

https://elasticsearch_80857.docs-preview.app.elstc.co/guide/en/elasticsearch/reference/master/knn-search.html

elasticmachine · 2021-11-19T18:13:11Z

Pinging @elastic/es-docs (Team:Docs)

elasticmachine · 2021-11-19T18:13:12Z

Pinging @elastic/es-search (Team:Search)

jrodewig · 2021-11-19T18:13:48Z

docs/reference/search/search-your-data/run-knn-search.asciidoc

+* Product recommendations and recommendation engines
+* Similarity search for images or videos
+* Outlier detection and classification


@giladgal Let me know if we want to highlight any other use cases here.

mayya-sharipova

@jrodewig Thanks for your work! This looks good overall, I've left a couple of small comments.

docs/reference/images/kNN/knn_diagram.svg

docs/reference/search/search-your-data/run-knn-search.asciidoc

jrodewig · 2021-11-22T15:41:02Z

Hi @giladgal. I pushed 67591d5 to address some of the feedback we discussed. Let me know if any other changes are needed. Thanks!

giladgal

Thanks for performing the changes. LGTM.

mayya-sharipova

@jrodewig Thank you for iterating, this PR LGTM!

jtibshirani

Thanks @jrodewig, this is a nice addition! I like the terminology and overall consistency (I noticed we always stick with "similarity" instead of "distance", for example).

docs/reference/images/kNN/knn_diagram.svg

jtibshirani · 2021-11-29T21:31:40Z

docs/reference/search/search-your-data/knn-search.asciidoc

+* Relevance ranking based on natural language processing (NLP) algorithms
+* Product recommendations and recommendation engines
+* Similarity search for images or videos
+* Outlier detection and classification


Personally I would omit "outlier detection and classification", since it's really not a use case we've designed for or thought deeply about yet.

jtibshirani · 2021-11-29T21:35:03Z

docs/reference/search/search-your-data/knn-search.asciidoc

+exchange for slower searches.
+
+Exact, brute-force kNN guarantees accurate results but doesn't scale well with
+large, unfiltered datasets. The vector function has to scan each matched


Saying "The vector function has to scan..." feels a little confusing -- I think of the "query" as having to do the scan, and for each document it applies a vector function. Maybe we could say "The script_score query scans each matched document and computes the vector function, which can result in slow search speeds."

I updated this paragraph to:

Exact, brute-force kNN guarantees accurate results but doesn’t scale well with large, unfiltered datasets. With this approach, a script_score query must scan each matched document to compute the vector function, which can result in slow search speeds. However, you can improve latency by using Query DSL to limit the number of matched documents passed to the function. If you filter your data to a small subset of documents, you can get good search performance using this approach.

jtibshirani · 2021-11-29T21:38:20Z

docs/reference/search/search-your-data/knn-search.asciidoc

+[[knn-prereqs]]
+=== Prerequisites
+
+* To run a kNN search, you must be able to convert your data into meaningful


This may be a little confusing, because we talk about dense_vector fields but then mention passing them to a query (which doesn't really make sense). Maybe this could be phrased something like this...

"To run a kNN search, you must be able to convert your data into meaningful vector values. You create these vectors outside of {es} and add them to documents through the <<dense-vector,dense_vector>> field. Queries must also be represented as vectors with the same dimension. The vectors should be designed so that the closer a document is to the query vector according to the similarity metric, the better its match."

Thanks for the suggestion. I updated this to:

To run a kNN search, you must be able to convert your data into meaningful vector values. You create these vectors outside of Elasticsearch and add them to documents as dense_vector field values. Queries are represented as vectors with the same dimension.

Design your vectors so that the closer a document’s vector is to a query vector, based on a similarity metric, the better its match.

jtibshirani · 2021-11-29T21:47:59Z

docs/reference/search/search-your-data/knn-search.asciidoc

+
+[discrete]
+[[approximate-knn]]
+=== Approximate kNN


The phrase and acronym the community uses for this is "approximate nearest neighbor (ANN) search". However we called the endpoint _knn_search, so I understand the motivation here and think this is a good compromise.

jtibshirani · 2021-11-29T21:51:09Z

docs/reference/search/search-your-data/knn-search.asciidoc

+[[approximate-knn-limitations]]
+==== Limitations for approximate kNN search
+
+* You can't currently use Query DSL to filter the documents on which an


Could we just say "You can't currently use Query DSL to filter the documents on which an
approximate kNN runs." ? I'm not sure what is meant by "... or the results of an approximate kNN search".

This was my attempt to address pre-filtering and post-filtering. But since neither is supported, we can probably just simply state that. I updated this bullet to:

You can't currently use Query DSL to filter documents for an approximate kNN search. If you need to filter the documents, consider using exact kNN instead.

jtibshirani · 2021-11-29T21:53:03Z

docs/reference/search/search-your-data/knn-search.asciidoc

+
+[discrete]
+[[approximate-knn-limitations]]
+==== Limitations for approximate kNN search


There are some other limitations -- it currently doesn't work with filtered index aliases or nested documents.

Thanks for pointing this out. I'll add them.

jtibshirani · 2021-11-29T21:57:52Z

docs/reference/search/search-your-data/knn-search.asciidoc

+
+* You can't currently use Query DSL to filter the documents on which an
+approximate kNN runs or the results of an approximate kNN search. If you need to
+restrictively filter the documents on which a kNN search runs, consider using


I'm not sure we should say "restrictively filter" instead of just "filter". From my perspective, it currently does not make sense to perform ANN with any filter, even if it is non-restrictive. It is very hard for a user to implement correctly -- with the straightforward approach (postfiltering), they could easily end up with fewer than k results even when k are available.

jtibshirani · 2021-11-29T22:04:55Z

docs/reference/search/search-your-data/knn-search.asciidoc

+[[tune-approximate-knn-for-speed-accuracy]]
+==== Tune approximate kNN for speed or accuracy
+
+For faster searches, the kNN search API collects `num_candidates` results from


To me the "For faster searches..." phrasing feels confusing -- what are we comparing against? The description also isn't totally accurate, since we filter down to k before merging results across shards.

Here's an idea for rewording: "On each shard, the kNN search API first finds an approximate set of nearest neighbor candidates of size num_candidates. It then computes the true vector similarity to each candidate, and selects the closest k. Finally, the best k results from each shard are merged together to find a global top k.

Thanks for the suggestion. I adapted it to:

To gather results, the kNN search API finds a num_candidates number of approximate nearest neighbor candidates on each shard. The search computes the similarity of each shard’s candidate vectors to the query vector, selecting the k most similar results from each shard. The search then merges the results from each shard to return the global top k nearest neighbors.

jtibshirani

This looks good to me!

jtibshirani · 2021-11-30T17:28:16Z

docs/reference/search/search-your-data/knn-search.asciidoc

+* You can't run an approximate kNN search on a `dense_vector` field within a
+<<nested,`nested`>> mapping.
+
+* You can't currently use Query DSL to filter documents for an approximate kNN


Same small comment, should this be "the Query DSL" ?

jtibshirani · 2021-11-30T17:30:10Z

docs/reference/search/search-your-data/knn-search.asciidoc

+Exact, brute-force kNN guarantees accurate results but doesn't scale well with
+large, unfiltered datasets. With this approach, a `script_score` query must scan
+each matched document to compute the vector function, which can result in slow
+search speeds. However, you can improve latency by using Query DSL to limit the


Super small comment, should this be "the Query DSL" and have a link?

jrodewig · 2021-11-30T17:42:46Z

Thanks @jtibshirani!

Adds a high-level guide for running an approximate or exact kNN search in Elasticsearch. Relates to #78473.

jrodewig mentioned this pull request Nov 18, 2021

Integrate ANN search #78473

Closed

17 tasks

elasticsearchmachine added the v8.1.0 label Nov 18, 2021

jrodewig added 2 commits November 19, 2021 12:36

[DOCS] Add high-level kNN guide

8cfe799

remove nested limitation

3c0e5ed

jrodewig added :Search/Search Search-related issues that do not fall into other categories >docs General docs changes v8.0.0-beta1 labels Nov 19, 2021

jrodewig marked this pull request as ready for review November 19, 2021 18:13

elasticmachine added Team:Docs Meta label for docs team Team:Search Meta label for search team labels Nov 19, 2021

jrodewig requested review from giladgal, jtibshirani and mayya-sharipova November 19, 2021 18:13

jrodewig commented Nov 19, 2021

View reviewed changes

reword embedding prereq

542cc54

mayya-sharipova reviewed Nov 19, 2021

View reviewed changes

jrodewig added 3 commits November 19, 2021 15:27

Address review feedback

29d37f1

Change title + anchor

b07f38b

Address offline review feedback from Gilad

67591d5

giladgal approved these changes Nov 22, 2021

View reviewed changes

mayya-sharipova approved these changes Nov 22, 2021

View reviewed changes

jtibshirani reviewed Nov 29, 2021

View reviewed changes

Address review feedback

eb192eb

jtibshirani approved these changes Nov 30, 2021

View reviewed changes

Update Query DSL refs

463403b

remove some repetitive wording

1b48f16

jrodewig merged commit 229d2d7 into elastic:master Nov 30, 2021

jrodewig deleted the docs__knn-guide branch November 30, 2021 19:17

jrodewig mentioned this pull request Nov 30, 2021

[8.0] [DOCS] Add high-level guide for kNN search (#80857) #81172

Merged

elasticsearchmachine pushed a commit that referenced this pull request Nov 30, 2021

[DOCS] Add high-level guide for kNN search (#80857) (#81172)

cd53ff6

Adds a high-level guide for running an approximate or exact kNN search in Elasticsearch. Relates to #78473.

[DOCS] Add high-level guide for kNN search #80857

[DOCS] Add high-level guide for kNN search #80857

Uh oh!

Conversation

jrodewig commented Nov 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Preview

Uh oh!

elasticmachine commented Nov 19, 2021

Uh oh!

elasticmachine commented Nov 19, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mayya-sharipova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jrodewig commented Nov 22, 2021

Uh oh!

giladgal left a comment

Choose a reason for hiding this comment

Uh oh!

mayya-sharipova left a comment

Choose a reason for hiding this comment

Uh oh!

jtibshirani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jtibshirani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jrodewig commented Nov 30, 2021

Uh oh!

Uh oh!

jrodewig commented Nov 18, 2021 •

edited

Loading