Skip to content

EF integration for generating embeddings for database vector data #34387

@roji

Description

@roji

(fleshing out a concept suggested by @luisquintanilla)

When ingesting data into a vector database, the input data (e.g. text) needs to be converted to embeddings. When the embeddings are stored in a general-purpose database that's being accessed via EF (e.g. Azure SQL, Cosmos, PG), we can simplify the ingestion story by driving embedding generation from within EF itself. In EF terms, the user would e.g. set some string propert, and when they do SaveChanges, embeddings would automatically get generated for it (in .NET, using a pre-configured model)

Note that some dedicated vector databases also allow for integrating embedding generation, exposing APIs for e.g. inserting text and performing embedding generation behind the scenes. This is very similar to this proposal, and suggests that there's value in doing it.

Notes

  • Data ingestion isn't just embedding generation; there's a whole potential pipeline of e.g. parsing a PDF file, sanitizing the output, performing various transformations on it, etc. - all that before actually generating the embeddings and saving them. This proposal does not cover all that: a .NET data ingestion pipeline is a significant undertaking that should be orthogonal to this effort, and should in particular work well also with the dedicated vector databases (Milvus, Qdrant...) where EF doesn't make much sense. However, once such a data ingestion pipeline exists, we definitely should see about integrating nicely it with EF as well.
  • This should all not be specific to any particular databases; EF is en route to supporting vector search capabilities in 3 databases for now (Azure SQL, Cosmos, PostgreSQL) and more are probably on the way. Any EF provider supporting writing vector data (e.g. as ReadOnlyMemory<float>) should be compatible with this.
  • As embedding generation would happen with external, dedicated components, we'd likely want to make this a separate EF plugin/extension.
  • Note that in some scenarios it may be desirable to generate embeddings from multiple input properties into a single embedding property (e.g. the product title, description and comments all go into the samer embeddings for cross-property semantic search); so there's isn't necessarily a one-to-one relationship between the input and the output here.
  • Typically, loading the embedding property when loading the entity isn't needed or desired, but EF always loads all properties (Support partial loading (i.e. not eagerly loading all scalar properties) #1387).

Implementation notes

  • A likely easy way to implement this would be via a SaveChanges interceptor; it would go over added entries, find the input properties and run the embedding generation process.
    • Note that interceptors are generally used by users, rather than by extensions; and we don't have an extension-facing extensibility point. But just setting up an interceptor from an extension may be fine.
  • Value converters are another possible direction, but have various limitations which probably make them unsuitable for this, at least currently.
  • We should keep this in mind when thinking about bulk insertion, which is probably also going to be a common ask with embedding data. EF doesn't yet have a bulk import API (that's Bulk import for efficient importing of data from the client into the database #27333).

/cc @LadyNaggaga @stephentoub

Metadata

Metadata

No fields configured for Feature.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions