Description
Introduce a new field of type vector
on which vector calculations can be done during rescoring phase
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"my_feature": {
"type": "vector"
}
}
}
}
Indexing
Allow only a single value per document
Allow to index both dense and sparse vectors?
Dense form:
PUT my_index/_doc/1
{
"my_feature": [11.5, 10.4, 23.0]
}
Sparse form (represented as list of dimension names and values for corresponding dimensions):
PUT my_index/_doc/1
{
"my_feature": {"1": 11.5, "5": 10.5, "101": 23.0}
}
Query and Rescoring
Introduce a special type of vector
query:
"vector" : {
"field" : "my_feature",
"query_vector": {"1": 3, "5": 10.5, "101": 12}
}
This query can only be used in the rescoring context.
This query produces a score for every document in the rescoring context in the following way:
- If a document doesn't have a vector value for
field
, 0 value will be returned - If a document does have a vector value for
field
: doc_vector, the cosine similarity between doc_vector andquery_vector
is calculated:
dotProduct(doc_vector, query_vector) / (sqrt(doc_vector) * sqrt(query_vector))
POST /_search
{
"query" : {"<user-query>"},
"rescore" : {
"window_size" : 50,
"query" : {
"rescore_query" : {
"vector" : {
"field" : "my_feature",
"query_vector": {"1": 3, "5": 10.5, "101": 12}
}
}
}
}
}
Internal encoding
-
Encoding of vectors:
Internally both dense and sparse vectors are encoded as sorted hash?
Thus dense array is transformed:
[4, 12] -> {0: 4, 1: 12}
Keys are sorted, so we can iterate over them instead of calculating hash -
What should be values in vectors?
- floats?
- smaller than floats? (lost some precision here, but less index size)
-
Vectors are encoded as binaries.