GitHub

VEP: https://github.com/vmware/versatile-data-kit/tree/main/specs/vep-milestone-25-vector-database-ingestion

With the rise in popularity of LLMs and RAG we see VDK as a core component to getting the data where we need it to be.

Example problem scenario:

A company has a powerful private LLM chatbot.
However they want it to be able to answer questions using the latest version of confluence docs jira tickets etc...
Retraining every night on the latest tickets/docs is not feasible.
Instead the opt to use RAG to improve the chatbot responses.

This leaves them with the question.
How do we populate the data?
Steps they need to complete

Read data from confluence/jira
Chunk into paragraphs(or something similar)
Embed into vector space
save Vector and paragraph in vector database
remove old information. For example if we are scraping jira every hour and we are writing details to the vector database we need to make sure we clean up all embeddings/chunks which were generated from old versions of the ticket.

Our goal

We want to template this.
We will build a datajob in VDK which reads data from confluence or jira and writes it to a DSM postgres instance with PGVector enabled. A embedding model will be running on a different machine which will be exposed through an API.
We will make requests to the API to create embeddings for us.

After this datajob is running we will create a template from this in which we think customers will be able to adopt to meet their use cases.

Proposed database table solution

embedding | text chunk | document id
[1,2,3,4,5,6] | in this document blah... | 15

Prerequisite reading:

Langchain (https://python.langchain.com/docs/expression_language/cookbook/retrieval)

Learning materials

https://www.freecodecamp.org/news/vector-search-and-rag-tutorial-using-llms-with-your-data/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Private AI: Vector database Ingestion

Example problem scenario:

Our goal

Proposed database table solution

Prerequisite reading:

Learning materials

Create a basic FASTAPI webservice with one endpoint

Make decision on what python apis should look like

Refine VEP and review with team

vdk-postgres: support writing vectors to a postgres instance with pgvector installed using the VDK

Efficiently handle new data

Get feedback from private AI team

Revisit job and decide on what aspects of the job could be templated to provide power to customers to create a pipeline fast

Usability testing

Implement state management for incremental data fetch in Confluence data source plugin

Private AI: Vector database Ingestion

Example problem scenario:

Our goal

Proposed database table solution

Prerequisite reading:

Learning materials

List view