Skip to content

[DOC] Asymmetric Distance Calculation and Random Rotation in k-NN #10650

@finnroblin

Description

@finnroblin

What do you want to do?

  • Request a change to existing documentation
  • Add new documentation
  • Report a technical problem with the documentation
  • Other

Tell us about your request. In 3.2 k-NN enhanced Disk Based Vector search to increase recall in binary indices. Asymmetric Distance Calculation (ADC) and Random Rotation (RR) are two separate user-configurable parameters that increase search quality (recall) at a slight performance penalty. The best performance/recall tradeoff is when they are used together which is why I'm clubbing both in a single documentation PR.

ADC and RR are each used within faiss binary indices. Random rotation is supported for 1 bit, 2 bit, and 4 bit quantization, while ADC is supported only for 1 bit quantization. For lower levels of quantization (2, 4 bits) the benefit of ADC is lesser so we do not support it.

The parameters can be specified like:



PUT vector-index
{
  "settings" : {
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "vector_field": {
        "type": "knn_vector",
        "dimension": 8,
        "method": {
            "name": "hnsw",
            "engine": "faiss",
            "space_type": "l2",
            "parameters": {
              "encoder": {
                "name": "binary",
                "parameters": {
                  "bits": 1,
// new parameters
		  "random_rotation": true,
		  "enable_adc": true
                }
              }
            }
        }
      }
    }
  }
}

ADC PR: opensearch-project/k-NN#2733

RR PR: opensearch-project/k-NN#2718

Version: 3.2

What other resources are available? Related issue containing algorithm background: opensearch-project/k-NN#2714

Full explanation of both methods:
Random Rotation

One consequence of binary quantization is that each vector entry is given equal weight during the quantization process. This leads to undesirable effects if the scale of variance is different in different dimensions.

For illustration imagine we have a 2-dimensional data distribution where the first dimension (x axis) has small variance and the second dimension (y axis) has high variance. The position of a vector in the second dimension may have more information than its position in the first dimension, but binary quantization treats each dimension as equally important. However, we can rotate the distribution and ”smooth“ variance (information) from the second dimension into the first dimension. See below for an illustration.

Image

Beyond the above toy example, we have observed that image datasets like sift and gist respond well to random rotation. Intuitively this is expected since images typically have some high-information pixels and other pixels with less information. Without random rotation the embedding model may treat these pixels equally and lose more information upon 32x compression.

Asymmetric Distance Computation (ADC)

The idea behind ADC is to maintain a full-precision query vector, while rescaling it to have meaningful distance computations against binary-quantized document vectors. It’s asymmetric in the sense that our query vector is float32 type, but the document vectors are 1-bit binary types.

When we traverse through the hnsw graph, all of our distance computations are asymmetric: they compare the float query vector to binary index vectors with a custom distance calculation derived by the science team.

Essentially, ADC allows us to maintain more information about the query vector and get a higher quality search without a significant memory penalty. Naive binary quantization introduces error in two different stages: 1), when the document vectors are quantized during ingestion, and 2), when the query vectors are quantized during search.

While stage 1) is necessary to achieve memory savings due to the number of documents, stage 2) is less necessary for reducing memory requirements.
Essentially, ADC allows us to maintain the full precision of our query vectors and boost the quality of search without paying a significant memory penalty.

Metadata

Metadata

Assignees

Labels

Backlog - DEVDeveloper assigned to issue is responsible for creating PR.v3.2.0

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions