Allow representing a document with multiple embeddings (dense vectors)

Currently the `dense_vector` field is a single-valued field. This is a limitation that forces a document to be repeated or split up into multiple documents when it's necessary to have multiple embeddings represent an entire document. This can be cumbersome and introduces either duplication of data or complexity for the application indexing documents and embeddings.  

A common scenario for this is when using embeddings to retrieve or rerank documents that have first been split into passages [1]. Each embedding is a representation of a passage (of roughly paragraph length) and document ranking can use, for example, the score of the best matching passage. Other approaches (ColBERT [2]) represent text using a bag of term embeddings, in which case a passage itself is represented by multiple embeddings.

Some initial ideas to improve this:
 * A multi-valued `dense_vector` field.
 * Perhaps like with `ranking_features`, another field type that supports `n` vectors/embeddings — `dense_vectors`
 * A `matrix` field type, since embeddings for a document share the same dimensionality. This introduces the possibility to also perform matrix operations between documents or between a static/query matrix and a document matrix for ranking tasks. An alternative to this would be to support `tensor` of 1, 2, 3 dimensions (for example) which is likely more appropriate than a `matrix`.
 
[1] [Pretrained Transformers for Text Ranking: BERT and Beyond](https://arxiv.org/abs/2010.06467), Section 3.3 Multi-Stage Ranking Architectures — From Passage to Document Ranking
[2] [ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT](https://arxiv.org/abs/2004.12832)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow representing a document with multiple embeddings (dense vectors) #72068

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Allow representing a document with multiple embeddings (dense vectors) #72068

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions