Add quantization techniques and links to byte vector #4893

kolchfa-aws · 2023-08-25T02:14:02Z

Add quantization techniquest and links to byte vector

Checklist

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

naveentatikonda · 2023-08-25T16:06:25Z

_field-types/supported-field-types/knn-vector.md

-{% include copy-curl.html %}
+{% include copy-curl.html %}
+
+### Quantization techniques


Can we also include the pseudo code used for ConsineSimilarity SpaceType and the reason behind using a different technique for different type of dataset(angular) which emphasizes the importance of a quantization technique based on type of data?

@naveentatikonda Done. Please review when you get a chance.

@kolchfa-aws Added few comments. Pls take a look

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

naveentatikonda · 2023-08-29T00:47:01Z

_field-types/supported-field-types/knn-vector.md

+
+For Euclidean datasets, we recommend using a scalar quantization technique with L2 space type because Euclidean distance is shift invariant. If you shift both $$x$$ and $$y$$ by the same $$z$$ then the distance remains the same, which means $$\lVert x-y\rVert =\lVert (x-z)-(y-z)\rVert$$.
+
+The following example pseudocode illustrates scalar quantization for the L2 space type:


We shouldn't explain it this way. Can you rephrase it to something like this:

The following example pseudocode illustrates the scalar quantization technique used for the benchmarking tests on Euclidean Datasets (uses L2 space type). Euclidean distance is shift invariant, if you shift both $$x$$ and $$y$$ by the same $$z$$ then the distance remains the same, which means $$\lVert x-y\rVert =\lVert (x-z)-(y-z)\rVert$$.

@naveentatikonda To clarify the wording in parentheses, what uses L2 space type? The scalar quantization technique?

no, the euclidean datasets uses L2 spaceType. The quantization technique or algorithm depends on the space type

naveentatikonda · 2023-08-29T00:50:17Z

_field-types/supported-field-types/knn-vector.md

+
+For angular datasets, we recommend using a scalar quantization technique with cosine similarity because cosine distance is not shift invariant ($$cos(x, y) \neq cos(x-z, y-z)$$). 
+
+The following example pseudocode illustrates scalar quantization for the cosine similarity space type:


Similarly, we need to rephrase this:

The following example pseudocode illustrates the scalar quantization technique used for the benchmarking tests on Angular Datasets (uses cosine similarity space type). Cosine similarity is not shift invariant ($$cos(x, y) \neq cos(x-z, y-z)$$).

naveentatikonda · 2023-08-29T00:53:31Z

_field-types/supported-field-types/knn-vector.md

+
+return Byte(bval)
+
+// For Negative Numbers


From For Negative Numbers, can we break it into separate code block to make it easy to understand.

naveentatikonda · 2023-08-29T00:53:39Z

_field-types/supported-field-types/knn-vector.md

+
+return Byte(bval)
+
+// For Negative Numbers


nit:

Suggested change

// For Negative Numbers

# For Negative Numbers

naveentatikonda · 2023-08-29T00:54:45Z

_field-types/supported-field-types/knn-vector.md

+### Quantization techniques
+
+If your vectors are of type `float`, you need to first convert them to `byte` before ingesting the documents. This conversion is accomplished by _quantizing the dataset_---reducing the precision of its vectors. There are many quantization techniques, such as scalar quantization or product quantization (PQ), which is used in the Faiss engine. The choice of quantization technique depends on the type of data you're using and can affect the accuracy of recall values. The following sections describe the scalar quantization algorithms that were used to quantize the [k-NN benchmarking test](https://github.com/opensearch-project/k-NN/tree/main/benchmarks/perf-tool) data for the [L2](#scalar-quantization-for-the-l2-space-type) and [cosine similarity](#scalar-quantization-for-the-cosine-similarity-space-type) space types.
+


Can we add a note or statement saying that the below quantization techniques are for reference

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

naveentatikonda · 2023-08-29T16:59:09Z

PR Looks Good. @kolchfa-aws Thanks for making the changes

_field-types/supported-field-types/knn-vector.md

Naarcha-AWS

A few small comments, otherwise LGTM

Naarcha-AWS · 2023-08-29T17:20:26Z

_field-types/supported-field-types/knn-vector.md

+
+#### Scalar quantization for the L2 space type
+
+The following example pseudocode illustrates the scalar quantization technique used for the benchmarking tests on Euclidean datasets with the L2 space type. Euclidean distance is shift invariant. If you shift both $$x$$ and $$y$$ by the same $$z$$ then the distance remains the same ($$\lVert x-y\rVert =\lVert (x-z)-(y-z)\rVert$$).


A shift invariant? Should we define shift invariant of do we do so earlier?

It's defined in the following sentence.

Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

natebower

@kolchfa-aws Please see my comments and changes and let me know if you have any questions or would like me to read any revisions. Thanks!

_field-types/supported-field-types/knn-vector.md

natebower · 2023-08-29T17:27:57Z

_field-types/supported-field-types/knn-vector.md

 ```

-However, if you intend to just use painless scripting or a k-NN score script, you only need to pass the dimension.
+However, if you intend to just use Painless scripting or a k-NN score script, you only need to pass the dimension.


I would either replace "just" with "only" or delete "just".

_field-types/supported-field-types/knn-vector.md

natebower · 2023-08-29T17:30:31Z

_field-types/supported-field-types/knn-vector.md

 Byte vectors are supported only for the `lucene` engine. They are not supported for the `nmslib` and `faiss` engines.
 {: .note}

+In [k-NN benchmarking tests](https://github.com/opensearch-project/k-NN/tree/main/benchmarks/perf-tool), the use of `byte` rather than `float` vectors resulted in a significant reduction in storage and memory usage while also improving indexing throughput and reducing query latency. Additionally, precision on recall was not greatly affected (note that recall can depend on various factors, such as the [quantization technique](#quantization-techniques) used and data distribution). 


This reads as though there should be some words before "data distribution". Something like "such as the quantization technique used and the type of data distribution."

_field-types/supported-field-types/knn-vector.md

natebower · 2023-08-29T17:35:43Z

_search-plugins/knn/knn-index.md


+## Lucene byte vector
+
+Starting with k-NN plugin version 2.9, you can use `byte` vectors with the `lucene` engine in order to save storage space. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector).


Something like "reduce the amount of storage space needed" instead of "save storage space"? To "save storage space" means to set it aside for another purpose.

Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

_field-types/supported-field-types/knn-vector.md

Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Add quantization techniquest and links to byte vector Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Added cosine similarity space type quantization Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Rewording Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Tech review feedback Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Tech review feedback Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Remove redundant line Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Update _field-types/supported-field-types/knn-vector.md Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _field-types/supported-field-types/knn-vector.md Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Editorial feedback Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> --------- Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower <nbower@amazon.com> (cherry picked from commit de5f5ae) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* Add quantization techniquest and links to byte vector * Added cosine similarity space type quantization * Rewording * Tech review feedback * Tech review feedback * Remove redundant line * Update _field-types/supported-field-types/knn-vector.md * Apply suggestions from code review * Update _field-types/supported-field-types/knn-vector.md * Editorial feedback --------- (cherry picked from commit de5f5ae) Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower <nbower@amazon.com>

…ect#4893) * Add quantization techniquest and links to byte vector Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Added cosine similarity space type quantization Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Rewording Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Tech review feedback Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Tech review feedback Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Remove redundant line Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Update _field-types/supported-field-types/knn-vector.md Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _field-types/supported-field-types/knn-vector.md Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Editorial feedback Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> --------- Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower <nbower@amazon.com>

* Add quantization techniquest and links to byte vector Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Added cosine similarity space type quantization Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Rewording Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Tech review feedback Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Tech review feedback Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Remove redundant line Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Update _field-types/supported-field-types/knn-vector.md Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _field-types/supported-field-types/knn-vector.md Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Editorial feedback Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> --------- Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower <nbower@amazon.com>

Add quantization techniquest and links to byte vector

52d0988

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

kolchfa-aws requested review from AMoo-Miki, Naarcha-AWS, ananzh, cwillum, hdhalter, natebower, seanneumann and vagimeli as code owners August 25, 2023 02:14

kolchfa-aws self-assigned this Aug 25, 2023

naveentatikonda suggested changes Aug 25, 2023

View reviewed changes

kolchfa-aws added 2 commits August 25, 2023 13:19

Added cosine similarity space type quantization

ff27d5f

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

Rewording

8275b54

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

kolchfa-aws changed the title ~~Add quantization techniquest and links to byte vector~~ Add quantization techniques and links to byte vector Aug 25, 2023

naveentatikonda suggested changes Aug 29, 2023

View reviewed changes

kolchfa-aws and others added 4 commits August 29, 2023 12:25

Tech review feedback

c75e4aa

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

Tech review feedback

603926c

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

Remove redundant line

6ebc5b3

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

Merge branch 'main' into knn-field-update

49b2b9c

naveentatikonda approved these changes Aug 29, 2023

View reviewed changes

Naarcha-AWS reviewed Aug 29, 2023

View reviewed changes

_field-types/supported-field-types/knn-vector.md Outdated Show resolved Hide resolved

Naarcha-AWS approved these changes Aug 29, 2023

View reviewed changes

Update _field-types/supported-field-types/knn-vector.md

10e10cc

Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

natebower reviewed Aug 29, 2023

View reviewed changes

Apply suggestions from code review

e54536e

Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

kolchfa-aws commented Aug 29, 2023

View reviewed changes

_field-types/supported-field-types/knn-vector.md Outdated Show resolved Hide resolved

kolchfa-aws and others added 2 commits August 29, 2023 13:40

Update _field-types/supported-field-types/knn-vector.md

1d3f3a4

Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

Editorial feedback

201ef6d

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

kolchfa-aws merged commit de5f5ae into main Aug 29, 2023

kolchfa-aws added the backport 2.9 PR: Backport label for 2.9 label Aug 29, 2023

opensearch-trigger-bot bot mentioned this pull request Aug 29, 2023

[Backport 2.9] Add quantization techniques and links to byte vector #4940

Merged

Naarcha-AWS deleted the knn-field-update branch March 28, 2024 23:20


		For Euclidean datasets, we recommend using a scalar quantization technique with L2 space type because Euclidean distance is shift invariant. If you shift both $$x$$ and $$y$$ by the same $$z$$ then the distance remains the same, which means $$\lVert x-y\rVert =\lVert (x-z)-(y-z)\rVert$$.

		The following example pseudocode illustrates scalar quantization for the L2 space type:


		For angular datasets, we recommend using a scalar quantization technique with cosine similarity because cosine distance is not shift invariant ($$cos(x, y) \neq cos(x-z, y-z)$$).

		The following example pseudocode illustrates scalar quantization for the cosine similarity space type:

		### Quantization techniques

		If your vectors are of type `float`, you need to first convert them to `byte` before ingesting the documents. This conversion is accomplished by _quantizing the dataset_---reducing the precision of its vectors. There are many quantization techniques, such as scalar quantization or product quantization (PQ), which is used in the Faiss engine. The choice of quantization technique depends on the type of data you're using and can affect the accuracy of recall values. The following sections describe the scalar quantization algorithms that were used to quantize the [k-NN benchmarking test](https://github.com/opensearch-project/k-NN/tree/main/benchmarks/perf-tool) data for the [L2](#scalar-quantization-for-the-l2-space-type) and [cosine similarity](#scalar-quantization-for-the-cosine-similarity-space-type) space types.


		#### Scalar quantization for the L2 space type

		The following example pseudocode illustrates the scalar quantization technique used for the benchmarking tests on Euclidean datasets with the L2 space type. Euclidean distance is shift invariant. If you shift both $$x$$ and $$y$$ by the same $$z$$ then the distance remains the same ($$\lVert x-y\rVert =\lVert (x-z)-(y-z)\rVert$$).


		## Lucene byte vector

		Starting with k-NN plugin version 2.9, you can use `byte` vectors with the `lucene` engine in order to save storage space. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector).

Add quantization techniques and links to byte vector #4893

Add quantization techniques and links to byte vector #4893

Uh oh!

Conversation

kolchfa-aws commented Aug 25, 2023

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

naveentatikonda Aug 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

naveentatikonda commented Aug 29, 2023

Uh oh!

Uh oh!

Naarcha-AWS left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

natebower left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

naveentatikonda Aug 29, 2023 •

edited

Loading