Skip to content

Conversation

@kolchfa-aws
Copy link
Collaborator

Add quantization techniquest and links to byte vector

Checklist

  • By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
    For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
{% include copy-curl.html %}
{% include copy-curl.html %}

### Quantization techniques
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also include the pseudo code used for ConsineSimilarity SpaceType and the reason behind using a different technique for different type of dataset(angular) which emphasizes the importance of a quantization technique based on type of data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@naveentatikonda Done. Please review when you get a chance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolchfa-aws Added few comments. Pls take a look

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
@kolchfa-aws kolchfa-aws changed the title Add quantization techniquest and links to byte vector Add quantization techniques and links to byte vector Aug 25, 2023

For Euclidean datasets, we recommend using a scalar quantization technique with L2 space type because Euclidean distance is shift invariant. If you shift both $$x$$ and $$y$$ by the same $$z$$ then the distance remains the same, which means $$\lVert x-y\rVert =\lVert (x-z)-(y-z)\rVert$$.

The following example pseudocode illustrates scalar quantization for the L2 space type:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't explain it this way. Can you rephrase it to something like this:

The following example pseudocode illustrates the scalar quantization technique used for the benchmarking tests on Euclidean Datasets (uses L2 space type). Euclidean distance is shift invariant, if you shift both $$x$$ and $$y$$ by the same $$z$$ then the distance remains the same, which means $$\lVert x-y\rVert =\lVert (x-z)-(y-z)\rVert$$.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@naveentatikonda To clarify the wording in parentheses, what uses L2 space type? The scalar quantization technique?

Copy link
Member

@naveentatikonda naveentatikonda Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, the euclidean datasets uses L2 spaceType. The quantization technique or algorithm depends on the space type


For angular datasets, we recommend using a scalar quantization technique with cosine similarity because cosine distance is not shift invariant ($$cos(x, y) \neq cos(x-z, y-z)$$).

The following example pseudocode illustrates scalar quantization for the cosine similarity space type:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, we need to rephrase this:

The following example pseudocode illustrates the scalar quantization technique used for the benchmarking tests on Angular Datasets (uses cosine similarity space type). Cosine similarity is not shift invariant ($$cos(x, y) \neq cos(x-z, y-z)$$).


return Byte(bval)

// For Negative Numbers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From For Negative Numbers, can we break it into separate code block to make it easy to understand.


return Byte(bval)

// For Negative Numbers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
// For Negative Numbers
# For Negative Numbers

### Quantization techniques

If your vectors are of type `float`, you need to first convert them to `byte` before ingesting the documents. This conversion is accomplished by _quantizing the dataset_---reducing the precision of its vectors. There are many quantization techniques, such as scalar quantization or product quantization (PQ), which is used in the Faiss engine. The choice of quantization technique depends on the type of data you're using and can affect the accuracy of recall values. The following sections describe the scalar quantization algorithms that were used to quantize the [k-NN benchmarking test](https://github.com/opensearch-project/k-NN/tree/main/benchmarks/perf-tool) data for the [L2](#scalar-quantization-for-the-l2-space-type) and [cosine similarity](#scalar-quantization-for-the-cosine-similarity-space-type) space types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a note or statement saying that the below quantization techniques are for reference

kolchfa-aws and others added 4 commits August 29, 2023 12:25
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
@naveentatikonda
Copy link
Member

PR Looks Good. @kolchfa-aws Thanks for making the changes

Copy link
Contributor

@Naarcha-AWS Naarcha-AWS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small comments, otherwise LGTM


#### Scalar quantization for the L2 space type

The following example pseudocode illustrates the scalar quantization technique used for the benchmarking tests on Euclidean datasets with the L2 space type. Euclidean distance is shift invariant. If you shift both $$x$$ and $$y$$ by the same $$z$$ then the distance remains the same ($$\lVert x-y\rVert =\lVert (x-z)-(y-z)\rVert$$).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A shift invariant? Should we define shift invariant of do we do so earlier?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's defined in the following sentence.

Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolchfa-aws Please see my comments and changes and let me know if you have any questions or would like me to read any revisions. Thanks!

```

However, if you intend to just use painless scripting or a k-NN score script, you only need to pass the dimension.
However, if you intend to just use Painless scripting or a k-NN score script, you only need to pass the dimension.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would either replace "just" with "only" or delete "just".

Byte vectors are supported only for the `lucene` engine. They are not supported for the `nmslib` and `faiss` engines.
{: .note}

In [k-NN benchmarking tests](https://github.com/opensearch-project/k-NN/tree/main/benchmarks/perf-tool), the use of `byte` rather than `float` vectors resulted in a significant reduction in storage and memory usage while also improving indexing throughput and reducing query latency. Additionally, precision on recall was not greatly affected (note that recall can depend on various factors, such as the [quantization technique](#quantization-techniques) used and data distribution).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads as though there should be some words before "data distribution". Something like "such as the quantization technique used and the type of data distribution."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded.


## Lucene byte vector

Starting with k-NN plugin version 2.9, you can use `byte` vectors with the `lucene` engine in order to save storage space. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like "reduce the amount of storage space needed" instead of "save storage space"? To "save storage space" means to set it aside for another purpose.

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
kolchfa-aws and others added 2 commits August 29, 2023 13:40
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
@kolchfa-aws kolchfa-aws merged commit de5f5ae into main Aug 29, 2023
@kolchfa-aws kolchfa-aws added the backport 2.9 PR: Backport label for 2.9 label Aug 29, 2023
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 29, 2023
* Add quantization techniquest and links to byte vector

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Added cosine similarity space type quantization

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Rewording

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Tech review feedback

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Tech review feedback

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Remove redundant line

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Update _field-types/supported-field-types/knn-vector.md

Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

* Update _field-types/supported-field-types/knn-vector.md

Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

* Editorial feedback

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

---------

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Co-authored-by: Nathan Bower <nbower@amazon.com>
(cherry picked from commit de5f5ae)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
kolchfa-aws pushed a commit that referenced this pull request Aug 29, 2023
* Add quantization techniquest and links to byte vector



* Added cosine similarity space type quantization



* Rewording



* Tech review feedback



* Tech review feedback



* Remove redundant line



* Update _field-types/supported-field-types/knn-vector.md




* Apply suggestions from code review




* Update _field-types/supported-field-types/knn-vector.md



* Editorial feedback



---------





(cherry picked from commit de5f5ae)

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Co-authored-by: Nathan Bower <nbower@amazon.com>
harshavamsi pushed a commit to harshavamsi/documentation-website that referenced this pull request Oct 31, 2023
…ect#4893)

* Add quantization techniquest and links to byte vector

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Added cosine similarity space type quantization

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Rewording

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Tech review feedback

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Tech review feedback

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Remove redundant line

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Update _field-types/supported-field-types/knn-vector.md

Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

* Update _field-types/supported-field-types/knn-vector.md

Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

* Editorial feedback

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

---------

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Co-authored-by: Nathan Bower <nbower@amazon.com>
vagimeli pushed a commit that referenced this pull request Dec 21, 2023
* Add quantization techniquest and links to byte vector

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Added cosine similarity space type quantization

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Rewording

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Tech review feedback

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Tech review feedback

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Remove redundant line

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Update _field-types/supported-field-types/knn-vector.md

Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

* Update _field-types/supported-field-types/knn-vector.md

Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

* Editorial feedback

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

---------

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Co-authored-by: Nathan Bower <nbower@amazon.com>
@Naarcha-AWS Naarcha-AWS deleted the knn-field-update branch March 28, 2024 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 2.9 PR: Backport label for 2.9

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants