Skip to content

Private AI: Vector database Ingestion

Open
No due date
Last updated Apr 26, 2024
70% complete

VEP: https://github.com/vmware/versatile-data-kit/tree/main/specs/vep-milestone-25-vector-database-ingestion

With the rise in popularity of LLMs and RAG we see VDK as a core component to getting the data where we need it to be.

image

Example problem scenario:

A company has a powerful private LLM chatbot.
However they want it to be able to answer questions using the latest version of confluence docs jira tickets etc...
Retraining every night on the latest tickets/docs is not feasible.
Instead the opt to use RAG to improve the chatbot responses.

This leaves them with the question.
How do we populate the data?
Steps they need to complete

  1. Read data from confluence/jira
  2. Chunk into paragraphs(or something similar)
  3. Embed into vector space
  4. save Vector and paragraph in vector database
  5. remove old information. For example if we are scraping jira every hour and we are writing details to the vector database we need to make sure we clean up all embeddings/chunks which were generated from old versions of the ticket.

Our goal

We want to template this.
We will build a datajob in VDK which reads data from confluence or jira and writes it to a DSM postgres instance with PGVector enabled. A embedding model will be running on a different machine which will be exposed through an API.
We will make requests to the API to create embeddings for us.

After this datajob is running we will create a template from this in which we think customers will be able to adopt to meet their use cases.

Proposed database table solution

embedding | text chunk | document id
[1,2,3,4,5,6] | in this document blah... | 15

Prerequisite reading:

Learning materials

List view