Skip to content

Allow representing a document with multiple embeddings (dense vectors) #72068

Open
@joshdevins

Description

@joshdevins

Currently the dense_vector field is a single-valued field. This is a limitation that forces a document to be repeated or split up into multiple documents when it's necessary to have multiple embeddings represent an entire document. This can be cumbersome and introduces either duplication of data or complexity for the application indexing documents and embeddings.

A common scenario for this is when using embeddings to retrieve or rerank documents that have first been split into passages [1]. Each embedding is a representation of a passage (of roughly paragraph length) and document ranking can use, for example, the score of the best matching passage. Other approaches (ColBERT [2]) represent text using a bag of term embeddings, in which case a passage itself is represented by multiple embeddings.

Some initial ideas to improve this:

  • A multi-valued dense_vector field.
  • Perhaps like with ranking_features, another field type that supports n vectors/embeddings — dense_vectors
  • A matrix field type, since embeddings for a document share the same dimensionality. This introduces the possibility to also perform matrix operations between documents or between a static/query matrix and a document matrix for ranking tasks. An alternative to this would be to support tensor of 1, 2, 3 dimensions (for example) which is likely more appropriate than a matrix.

[1] Pretrained Transformers for Text Ranking: BERT and Beyond, Section 3.3 Multi-Stage Ranking Architectures — From Passage to Document Ranking
[2] ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions