-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Extend dense_vector
to support indexing vectors
#78491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Some notes on follow-ups:
I also ran a quick benchmark and noticed that script scoring with indexed vectors is significantly slower than vectors based on doc values. For example on a small dataset QPS went from |
Pinging @elastic/es-search (Team:Search) |
} | ||
|
||
private Field parseKnnVector(DocumentParserContext context) throws IOException { | ||
float[] vector = new float[dims]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice (may be for future) to create float[] vector = new float[dims] once in DenseVectorFieldMapper constructor and they reuse it instead of creating a new array each time during parsing. Especially considering that Lucene's VectorValuesWriter
would anyway make a separate copy: vectors.add(ArrayUtil.copyOfSubArray(vectorValue, 0, vectorValue.length))
. But I don't know how if all this is done within a single thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can keep an eye on this in indexing benchmarks. Like you said, the indexing logic often copies field values, so I don't expect this array creation to be a bottleneck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtibshirani Thanks Julie, great work! This looks almost perfect to me. I've just left a couple of questions to clarify things.
One thing about API that is a little bothering me is how we define similarity in a mapping for a vector field, but at the same time can apply different similarity functions in scripts for it. This looks to be confusing, but may be good documentation should be enough to resolve this confusion.
"index": true,
"similarity": "l2_norm"
|
||
@Override | ||
public void setNextDocId(int docId) throws IOException { | ||
int currentDoc = in.docID(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice implementation of advanceExact!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a little tricky to get right, I wonder if it'd be helpful to add it to VectorValues
in Lucene.
return new DenseVectorScriptDocValues(values, indexVersion, dims); | ||
if (indexed) { | ||
VectorValues values = reader.getVectorValues(field); | ||
if (values == null || values == VectorValues.EMPTY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should values
be always not null and always be not empty if a vector field is indexed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if there's a vector field in the mapping, the segment could have no documents with that field. Depending on the exact context, Lucene could return either null
or an empty instance in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explanation! I can see how we can get null
. For this may be later, we can do something similar to DocValues.getBinary
that never returns null.
But I think we should never get VectorValues.EMPTY
? No?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lucene might return VectorValues.EMPTY
e.g. if a field has been indexed with vectors and later all docs that had a vector indexed got merged away. Since the field exists in FieldInfos
and has vectors enabled, it is illegal for Lucene to return null
in that case.
.../vectors/src/main/java/org/elasticsearch/xpack/vectors/query/DenseVectorScriptDocValues.java
Show resolved
Hide resolved
x-pack/plugin/vectors/src/main/java/org/elasticsearch/xpack/vectors/query/ScoreScriptUtils.java
Show resolved
Hide resolved
import java.io.IOException; | ||
import java.util.Objects; | ||
|
||
public class KnnVectorFieldExistsQuery extends Query { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also thinking of moving this to Lucene eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we enforce schema consistency, we should hopefully be able to have a single FieldExistsQuery
that would look at doc values / norms / vectors depending on the FieldInfo
to know how to iterate over documents that have a value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I'll revisit this at a later time.
Thanks @mayya-sharipova for reviewing, I tried to address all your comments.
I agree it could be confusing. Another option is to rename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtibshirani This overall LGTM!
Just one thing, I think in VectorDVLeafFieldData
we should never get VectorValues.EMPTY
? NO?
We can indeed receive |
@jtibshirani Thanks for the explanation, Julie, makes sense! All is good from my side! |
This PR extends the
dense_vector
type to allow vectors to be added to an ANN index:A description of the parameters:
index
. Setting this totrue
indicates the field should be added to the ANN index. The values will be parsed as aKnnVectorField
instead of a doc values field. By defaultindex: false
to provide a smooth transition from 7.x, where vectors are not indexed.similarity
. Whenindex: true
, it's required to specify what similarity to use when indexing the vectors. Right now the accepted values arel2_norm
anddot_product
, which matches the Lucene options. (We decided to requiresimilarity
to be set since there's no default choice that works in general, and it's easy to overlook and accidentally get poor results.)Indexed vectors still support the same functionality as vectors based on doc values -- they work with vector script functions and
exists
queries.Relates to #78473.