Description
The dense_vector
type helps users work with vector 'embeddings' of unstructured data like text and images. This issue proposes to add a new 'bit vector' type and 'hamming distance' script function as part of supporting this use case.
Dense vector fields allow for storing float vectors. For images, it also seems common to use bit vectors:
- In the paper Visual Search at Pinterest, the image descriptors are created by combining local and 'deep' features from a CNN, then binarizing them to obtain bit vectors.
- The paper Fast and Exact Nearest Neighbor Search in Hamming Space on Full-Text Search Engines explores how online retailers could support visual search by modeling images using binary codes.
There has also been recent work on converting traditional text embeddings to bit vectors, for example Learning Compressed Sentence Representations for On-Device Text Processing.
Compared to using a dense_vector
to represent the binary vectors, a dedicated 'bit vector' type would require less space and could support faster distance computations. Looking forward, it may also be possible to support retrieval based on bit vector distance through a specialized strategy (distinct from what we've considered for float vectors in #42326).