Vector Embedding Brainstorming #14

Shubham-Khichi · 2025-01-24T16:55:39Z

Shubham-Khichi
Jan 24, 2025
Maintainer

You all know that I'm not only building DevDocs for developers but also for folks who want to get into LLM tech. One of the primary goals of our company products is Simplicity Of Use: "How can we build a product that even a grandma can use it" kind of simplicity.

Every decision made is to make DevDocs and our other products easy to use, every feature is designed and refined to ensure that after few iteration this feature becomes super intuitive.

Vector databases which is our 3rd most requested feature is conflicting with our notion of simplicity of use because the inherent complexity of chunking, embedding and setting up a vector db on top of configuring so many parameters makes this process difficult, not to mention giving these parameters in the hands of folks to play around and sometimes mess up the variables and now your data doesn't get properly pulled by the LLM.

So traditional vector storage methods will fail us. So I am leaning towards using Copali to bypass the complex structure of websites, context management and making sure that embeddings are happening right.

Here is the thought process:

We will convert the website into PDF (md and json formats will still be available if you want to create datasets to finetune)
A user will input their vector storage APIs into the UI which will remain persistent during their entire account history.
Provide a URL and let DevDocs ensure it spiders, scrapes, loads into MCP and create Md, json and PDF formats
With the PDF version DevDocs uses a local vision model to use Copali's capabilities to embed the entire page of technical documentation into the vector storage so that CONTEXT is not lost during retrieval.
Victory

Reason for this approach is technical documents are already structured in a page. Meaning that 1 concept is explained in 1 page, if we embed the whole page then the concept and its relevant examples are not lost.

What are your thoughts on this. Suggestions to enhance or better solution is always welcomed. Ok

TomLucidor · 2025-02-28T02:03:08Z

TomLucidor
Feb 28, 2025

PDF is a waste of space, and harsh on parsing. Unless it is for documentation of polished products, PDFs are generally a bad idea. JSON, YAML, MarkDown etc. all have tradeoffs but generally optimistic on lightweight markup

3 replies

Shubham-Khichi Mar 4, 2025
Maintainer Author

I agree with you on PDFs since they were made for humans not machines. Since a lot of human data is stored in PDFs like books, contents, etc there should be a feature to digest it. Luckily OCR and embedding technology has improved a lot to parse the data from PDFs.

TomLucidor Mar 4, 2025

If you are talking about raw data ingestion/extraction, this makes more sense. Don't make PDF just to re-chew

Shubham-Khichi Mar 4, 2025
Maintainer Author

If you are talking about raw data ingestion/extraction, this makes more sense.

Yes raw data extraction and storage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Vector Embedding Brainstorming #14

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Vector Embedding Brainstorming #14

Uh oh!

Uh oh!

Shubham-Khichi Jan 24, 2025 Maintainer

Replies: 1 comment · 3 replies

Uh oh!

Uh oh!

TomLucidor Feb 28, 2025

Uh oh!

Shubham-Khichi Mar 4, 2025 Maintainer Author

Uh oh!

TomLucidor Mar 4, 2025

Uh oh!

Shubham-Khichi Mar 4, 2025 Maintainer Author

Shubham-Khichi
Jan 24, 2025
Maintainer

Replies: 1 comment 3 replies

TomLucidor
Feb 28, 2025

Shubham-Khichi Mar 4, 2025
Maintainer Author

Shubham-Khichi Mar 4, 2025
Maintainer Author