-
Notifications
You must be signed in to change notification settings - Fork 623
Add quantization techniques and links to byte vector #4893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
| {% include copy-curl.html %} | ||
| {% include copy-curl.html %} | ||
|
|
||
| ### Quantization techniques |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also include the pseudo code used for ConsineSimilarity SpaceType and the reason behind using a different technique for different type of dataset(angular) which emphasizes the importance of a quantization technique based on type of data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@naveentatikonda Done. Please review when you get a chance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kolchfa-aws Added few comments. Pls take a look
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
|
|
||
| For Euclidean datasets, we recommend using a scalar quantization technique with L2 space type because Euclidean distance is shift invariant. If you shift both $$x$$ and $$y$$ by the same $$z$$ then the distance remains the same, which means $$\lVert x-y\rVert =\lVert (x-z)-(y-z)\rVert$$. | ||
|
|
||
| The following example pseudocode illustrates scalar quantization for the L2 space type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't explain it this way. Can you rephrase it to something like this:
The following example pseudocode illustrates the scalar quantization technique used for the benchmarking tests on Euclidean Datasets (uses L2 space type). Euclidean distance is shift invariant, if you shift both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@naveentatikonda To clarify the wording in parentheses, what uses L2 space type? The scalar quantization technique?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, the euclidean datasets uses L2 spaceType. The quantization technique or algorithm depends on the space type
|
|
||
| For angular datasets, we recommend using a scalar quantization technique with cosine similarity because cosine distance is not shift invariant ($$cos(x, y) \neq cos(x-z, y-z)$$). | ||
|
|
||
| The following example pseudocode illustrates scalar quantization for the cosine similarity space type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, we need to rephrase this:
The following example pseudocode illustrates the scalar quantization technique used for the benchmarking tests on Angular Datasets (uses cosine similarity space type). Cosine similarity is not shift invariant (
|
|
||
| return Byte(bval) | ||
|
|
||
| // For Negative Numbers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From For Negative Numbers, can we break it into separate code block to make it easy to understand.
|
|
||
| return Byte(bval) | ||
|
|
||
| // For Negative Numbers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
| // For Negative Numbers | |
| # For Negative Numbers |
| ### Quantization techniques | ||
|
|
||
| If your vectors are of type `float`, you need to first convert them to `byte` before ingesting the documents. This conversion is accomplished by _quantizing the dataset_---reducing the precision of its vectors. There are many quantization techniques, such as scalar quantization or product quantization (PQ), which is used in the Faiss engine. The choice of quantization technique depends on the type of data you're using and can affect the accuracy of recall values. The following sections describe the scalar quantization algorithms that were used to quantize the [k-NN benchmarking test](https://github.com/opensearch-project/k-NN/tree/main/benchmarks/perf-tool) data for the [L2](#scalar-quantization-for-the-l2-space-type) and [cosine similarity](#scalar-quantization-for-the-cosine-similarity-space-type) space types. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a note or statement saying that the below quantization techniques are for reference
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
|
PR Looks Good. @kolchfa-aws Thanks for making the changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small comments, otherwise LGTM
|
|
||
| #### Scalar quantization for the L2 space type | ||
|
|
||
| The following example pseudocode illustrates the scalar quantization technique used for the benchmarking tests on Euclidean datasets with the L2 space type. Euclidean distance is shift invariant. If you shift both $$x$$ and $$y$$ by the same $$z$$ then the distance remains the same ($$\lVert x-y\rVert =\lVert (x-z)-(y-z)\rVert$$). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A shift invariant? Should we define shift invariant of do we do so earlier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's defined in the following sentence.
Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kolchfa-aws Please see my comments and changes and let me know if you have any questions or would like me to read any revisions. Thanks!
| ``` | ||
|
|
||
| However, if you intend to just use painless scripting or a k-NN score script, you only need to pass the dimension. | ||
| However, if you intend to just use Painless scripting or a k-NN score script, you only need to pass the dimension. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would either replace "just" with "only" or delete "just".
| Byte vectors are supported only for the `lucene` engine. They are not supported for the `nmslib` and `faiss` engines. | ||
| {: .note} | ||
|
|
||
| In [k-NN benchmarking tests](https://github.com/opensearch-project/k-NN/tree/main/benchmarks/perf-tool), the use of `byte` rather than `float` vectors resulted in a significant reduction in storage and memory usage while also improving indexing throughput and reducing query latency. Additionally, precision on recall was not greatly affected (note that recall can depend on various factors, such as the [quantization technique](#quantization-techniques) used and data distribution). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reads as though there should be some words before "data distribution". Something like "such as the quantization technique used and the type of data distribution."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworded.
_search-plugins/knn/knn-index.md
Outdated
|
|
||
| ## Lucene byte vector | ||
|
|
||
| Starting with k-NN plugin version 2.9, you can use `byte` vectors with the `lucene` engine in order to save storage space. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like "reduce the amount of storage space needed" instead of "save storage space"? To "save storage space" means to set it aside for another purpose.
Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
* Add quantization techniquest and links to byte vector Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Added cosine similarity space type quantization Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Rewording Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Tech review feedback Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Tech review feedback Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Remove redundant line Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Update _field-types/supported-field-types/knn-vector.md Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _field-types/supported-field-types/knn-vector.md Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Editorial feedback Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> --------- Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower <nbower@amazon.com> (cherry picked from commit de5f5ae) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Add quantization techniquest and links to byte vector * Added cosine similarity space type quantization * Rewording * Tech review feedback * Tech review feedback * Remove redundant line * Update _field-types/supported-field-types/knn-vector.md * Apply suggestions from code review * Update _field-types/supported-field-types/knn-vector.md * Editorial feedback --------- (cherry picked from commit de5f5ae) Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower <nbower@amazon.com>
…ect#4893) * Add quantization techniquest and links to byte vector Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Added cosine similarity space type quantization Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Rewording Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Tech review feedback Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Tech review feedback Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Remove redundant line Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Update _field-types/supported-field-types/knn-vector.md Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _field-types/supported-field-types/knn-vector.md Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Editorial feedback Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> --------- Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower <nbower@amazon.com>
* Add quantization techniquest and links to byte vector Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Added cosine similarity space type quantization Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Rewording Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Tech review feedback Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Tech review feedback Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Remove redundant line Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Update _field-types/supported-field-types/knn-vector.md Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _field-types/supported-field-types/knn-vector.md Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Editorial feedback Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> --------- Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower <nbower@amazon.com>
Add quantization techniquest and links to byte vector
Checklist
For more information on following Developer Certificate of Origin and signing off your commits, please check here.