You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When ingesting data into a vector database, the input data (e.g. text) needs to be converted to embeddings. When the embeddings are stored in a general-purpose database that's being accessed via EF (e.g. Azure SQL, Cosmos, PG), we can simplify the ingestion story by driving embedding generation from within EF itself. In EF terms, the user would e.g. set some string propert, and when they do SaveChanges, embeddings would automatically get generated for it (in .NET, using a pre-configured model)
Note that some dedicated vector databases also allow for integrating embedding generation, exposing APIs for e.g. inserting text and performing embedding generation behind the scenes. This is very similar to this proposal, and suggests that there's value in doing it.
Notes
Data ingestion isn't just embedding generation; there's a whole potential pipeline of e.g. parsing a PDF file, sanitizing the output, performing various transformations on it, etc. - all that before actually generating the embeddings and saving them. This proposal does not cover all that: a .NET data ingestion pipeline is a significant undertaking that should be orthogonal to this effort, and should in particular work well also with the dedicated vector databases (Milvus, Qdrant...) where EF doesn't make much sense. However, once such a data ingestion pipeline exists, we definitely should see about integrating nicely it with EF as well.
This should all not be specific to any particular databases; EF is en route to supporting vector search capabilities in 3 databases for now (Azure SQL, Cosmos, PostgreSQL) and more are probably on the way. Any EF provider supporting writing vector data (e.g. as ReadOnlyMemory<float>) should be compatible with this.
As embedding generation would happen with external, dedicated components, we'd likely want to make this a separate EF plugin/extension.
Note that in some scenarios it may be desirable to generate embeddings from multiple input properties into a single embedding property (e.g. the product title, description and comments all go into the samer embeddings for cross-property semantic search); so there's isn't necessarily a one-to-one relationship between the input and the output here.
A likely easy way to implement this would be via a SaveChanges interceptor; it would go over added entries, find the input properties and run the embedding generation process.
Note that interceptors are generally used by users, rather than by extensions; and we don't have an extension-facing extensibility point. But just setting up an interceptor from an extension may be fine.
Value converters are another possible direction, but have various limitations which probably make them unsuitable for this, at least currently.
(fleshing out a concept suggested by @luisquintanilla)
When ingesting data into a vector database, the input data (e.g. text) needs to be converted to embeddings. When the embeddings are stored in a general-purpose database that's being accessed via EF (e.g. Azure SQL, Cosmos, PG), we can simplify the ingestion story by driving embedding generation from within EF itself. In EF terms, the user would e.g. set some string propert, and when they do SaveChanges, embeddings would automatically get generated for it (in .NET, using a pre-configured model)
Note that some dedicated vector databases also allow for integrating embedding generation, exposing APIs for e.g. inserting text and performing embedding generation behind the scenes. This is very similar to this proposal, and suggests that there's value in doing it.
Notes
ReadOnlyMemory<float>) should be compatible with this.Implementation notes
/cc @LadyNaggaga @stephentoub