Skip to content

Extend dense_vector to support indexing vectors #78491

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Oct 5, 2021

Conversation

jtibshirani
Copy link
Contributor

@jtibshirani jtibshirani commented Sep 29, 2021

This PR extends the dense_vector type to allow vectors to be added to an ANN index:

"mappings": {
  "properties": {
    "my_vector": {
      "type": "dense_vector",
      "dims": 128,
      "index": true,
      "similarity": "l2_norm"
    }
  }
}

A description of the parameters:

  • index. Setting this to true indicates the field should be added to the ANN index. The values will be parsed as a KnnVectorField instead of a doc values field. By default index: false to provide a smooth transition from 7.x, where vectors are not indexed.
  • similarity. When index: true, it's required to specify what similarity to use when indexing the vectors. Right now the accepted values are l2_norm and dot_product, which matches the Lucene options. (We decided to require similarity to be set since there's no default choice that works in general, and it's easy to overlook and accidentally get poor results.)

Indexed vectors still support the same functionality as vectors based on doc values -- they work with vector script functions and exists queries.

Relates to #78473.

@jtibshirani
Copy link
Contributor Author

jtibshirani commented Sep 29, 2021

Some notes on follow-ups:

  • Right now similarity can be l2_norm or dot_product, matching the Lucene options. I'd like to investigate whether it'd be better to support cosine similarity. I think the idea in Lucene is that dot product should be used to implement a fast cosine similarity, by normalizing all vectors to unit length beforehand. My concern is that dot product is not a true metric, and users can receive confusing results if they forget to normalize the vectors.
  • Text fields also have an option called similarity. I think this is fine, and even provides a nice analogy. But we'll need to adjust the docs to clarify that similarity doesn't just apply to text types.
  • It proved tricky to allow the HNSW parameters to be configured through the mapping. I'll tackle that in a future PR.

I also ran a quick benchmark and noticed that script scoring with indexed vectors is significantly slower than vectors based on doc values. For example on a small dataset QPS went from 455.143 to 273.596 when scoring all vectors. I don't think this is a big deal, since this feature is focused on ANN and not exact-match kNN. But it did raise some ideas for speeding up the distance computations used in Lucene.

@jtibshirani jtibshirani mentioned this pull request Sep 30, 2021
17 tasks
@jtibshirani jtibshirani marked this pull request as ready for review September 30, 2021 01:02
@pgomulka pgomulka added the :Search Foundations/Mapping Index mappings, including merging and defining field types label Sep 30, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Sep 30, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

}

private Field parseKnnVector(DocumentParserContext context) throws IOException {
float[] vector = new float[dims];
Copy link
Contributor

@mayya-sharipova mayya-sharipova Oct 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice (may be for future) to create float[] vector = new float[dims] once in DenseVectorFieldMapper constructor and they reuse it instead of creating a new array each time during parsing. Especially considering that Lucene's VectorValuesWriter would anyway make a separate copy: vectors.add(ArrayUtil.copyOfSubArray(vectorValue, 0, vectorValue.length)). But I don't know how if all this is done within a single thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can keep an eye on this in indexing benchmarks. Like you said, the indexing logic often copies field values, so I don't expect this array creation to be a bottleneck.

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani Thanks Julie, great work! This looks almost perfect to me. I've just left a couple of questions to clarify things.

One thing about API that is a little bothering me is how we define similarity in a mapping for a vector field, but at the same time can apply different similarity functions in scripts for it. This looks to be confusing, but may be good documentation should be enough to resolve this confusion.

      "index": true,
      "similarity": "l2_norm"


@Override
public void setNextDocId(int docId) throws IOException {
int currentDoc = in.docID();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice implementation of advanceExact!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a little tricky to get right, I wonder if it'd be helpful to add it to VectorValues in Lucene.

return new DenseVectorScriptDocValues(values, indexVersion, dims);
if (indexed) {
VectorValues values = reader.getVectorValues(field);
if (values == null || values == VectorValues.EMPTY) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should values be always not null and always be not empty if a vector field is indexed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if there's a vector field in the mapping, the segment could have no documents with that field. Depending on the exact context, Lucene could return either null or an empty instance in this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explanation! I can see how we can get null. For this may be later, we can do something similar to DocValues.getBinary that never returns null.

But I think we should never get VectorValues.EMPTY ? No?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lucene might return VectorValues.EMPTY e.g. if a field has been indexed with vectors and later all docs that had a vector indexed got merged away. Since the field exists in FieldInfos and has vectors enabled, it is illegal for Lucene to return null in that case.

import java.io.IOException;
import java.util.Objects;

public class KnnVectorFieldExistsQuery extends Query {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also thinking of moving this to Lucene eventually.

Copy link
Contributor

@jpountz jpountz Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we enforce schema consistency, we should hopefully be able to have a single FieldExistsQuery that would look at doc values / norms / vectors depending on the FieldInfo to know how to iterate over documents that have a value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I'll revisit this at a later time.

@jtibshirani
Copy link
Contributor Author

Thanks @mayya-sharipova for reviewing, I tried to address all your comments.

One thing about API that is a little bothering me is how we define similarity in a mapping for a vector field, but at the same time can apply different similarity functions in scripts for it.

I agree it could be confusing. Another option is to rename similarity to something like index_similarity, or move it under a new index_options section. But I like the current naming, it is simple and corresponds nicely to similarity for text fields. Maybe we can just clearly document this? Going forward it may be less common to use vector script functions, in favor of ANN plus built-in ways for accessing vector similarity scores... so it could become less of a concern.

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani This overall LGTM!

Just one thing, I think in VectorDVLeafFieldData we should never get VectorValues.EMPTY? NO?

@jtibshirani
Copy link
Contributor Author

Just one thing, I think in VectorDVLeafFieldData we should never get VectorValues.EMPTY? NO?

We can indeed receive VectorValues.EMPTY (it popped up in tests). My understanding is that if a segment contains no documents with a field, then LeafReader#getVectorValues is allowed to return either null or VectorValues.EMPTY. This applies to other data structures too, the accessor methods can return either null or an empty instance.

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Oct 4, 2021

@jtibshirani Thanks for the explanation, Julie, makes sense! All is good from my side!

@jtibshirani jtibshirani merged commit 9ce1a29 into elastic:master Oct 5, 2021
@jtibshirani jtibshirani deleted the dense-vector branch October 5, 2021 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Meta label for search team v8.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants