Vector Embedding Brainstorming #14
Shubham-Khichi
started this conversation in
Ideas
Replies: 1 comment 3 replies
-
|
PDF is a waste of space, and harsh on parsing. Unless it is for documentation of polished products, PDFs are generally a bad idea. JSON, YAML, MarkDown etc. all have tradeoffs but generally optimistic on lightweight markup |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
You all know that I'm not only building DevDocs for developers but also for folks who want to get into LLM tech. One of the primary goals of our company products is Simplicity Of Use: "How can we build a product that even a grandma can use it" kind of simplicity.
Every decision made is to make DevDocs and our other products easy to use, every feature is designed and refined to ensure that after few iteration this feature becomes super intuitive.
Vector databases which is our 3rd most requested feature is conflicting with our notion of simplicity of use because the inherent complexity of chunking, embedding and setting up a vector db on top of configuring so many parameters makes this process difficult, not to mention giving these parameters in the hands of folks to play around and sometimes mess up the variables and now your data doesn't get properly pulled by the LLM.
So traditional vector storage methods will fail us. So I am leaning towards using Copali to bypass the complex structure of websites, context management and making sure that embeddings are happening right.
Here is the thought process:
We will convert the website into PDF (md and json formats will still be available if you want to create datasets to finetune)
A user will input their vector storage APIs into the UI which will remain persistent during their entire account history.
Provide a URL and let DevDocs ensure it spiders, scrapes, loads into MCP and create Md, json and PDF formats
With the PDF version DevDocs uses a local vision model to use Copali's capabilities to embed the entire page of technical documentation into the vector storage so that CONTEXT is not lost during retrieval.
Victory
Reason for this approach is technical documents are already structured in a page. Meaning that 1 concept is explained in 1 page, if we embed the whole page then the concept and its relevant examples are not lost.
What are your thoughts on this. Suggestions to enhance or better solution is always welcomed. Ok
Beta Was this translation helpful? Give feedback.
All reactions