With the rise in popularity of LLMs and RAG we see VDK as a core component to getting the data where we need it to be.
Example problem scenario:
A company has a powerful private LLM chatbot.
However they want it to be able to answer questions using the latest version of confluence docs jira tickets etc...
Retraining every night on the latest tickets/docs is not feasible.
Instead the opt to use RAG to improve the chatbot responses.
This leaves them with the question.
How do we populate the data?
Steps they need to complete
- Read data from confluence/jira
- Chunk into paragraphs(or something similar)
- Embed into vector space
- save Vector and paragraph in vector database
- remove old information. For example if we are scraping jira every hour and we are writing details to the vector database we need to make sure we clean up all embeddings/chunks which were generated from old versions of the ticket.
Our goal
We want to template this.
We will build a datajob in VDK which reads data from confluence or jira and writes it to a DSM postgres instance with PGVector enabled. A embedding model will be running on a different machine which will be exposed through an API.
We will make requests to the API to create embeddings for us.
After this datajob is running we will create a template from this in which we think customers will be able to adopt to meet their use cases.
Proposed database table solution
embedding | text chunk | document id
[1,2,3,4,5,6] | in this document blah... | 15
Prerequisite reading:
Learning materials
List view
0 issues of 9 selected
- Status: Open.#3018 In vmware/versatile-data-kit;
- Status: Open.#3012 In vmware/versatile-data-kit;
- Status: Open.#3010 In vmware/versatile-data-kit;
- Status: Open.#2994 In vmware/versatile-data-kit;
- Status: Open.#3003 In vmware/versatile-data-kit;
- Status: Open.#3014 In vmware/versatile-data-kit;
- Status: Open.#2997 In vmware/versatile-data-kit;
- Status: Open.#2998 In vmware/versatile-data-kit;