Extend the vector database. #2

karthik18495 · 2024-10-01T21:44:04Z

GitHub Issue: Extend Vector Database with Public and Indico Pages for EIC Information

Issue Title:
Extend the Vector Database to Include Information from Public and Indico Pages for EIC

Description:

The Electron-Ion Collider (EIC) project would benefit from expanding the existing vector database to include data from public and Indico pages. This will enhance the system's ability to retrieve relevant documents and presentations for users. By incorporating these sources, we can provide a more comprehensive dataset for retrieval and improve the contextual quality of the responses.

This issue proposes:

Collecting presentations and documents from EIC-related Indico events and public web pages.
Preprocessing the collected documents (PDFs, text, HTML) and converting them into embeddings.
Storing these embeddings along with metadata in the current vector database.

Use Case:

Users can query the extended vector database to retrieve specific EIC-related documents, presentations, or meeting notes, allowing them to discover both internal and public information from Indico and other public sources.

Tasks:

Data Collection:
- Set up scripts to fetch data from Indico using its API, extracting event details, presentations, and associated documents.
- Scrape public pages relevant to EIC (e.g., documentation pages, wikis) for documents, presentations, and other useful content.
Preprocessing:
- Convert documents from multiple formats (PDFs, Word, HTML) into plain text using libraries like PyMuPDF or pdfminer.
- Apply NLP preprocessing steps: tokenization, stop-word removal, lemmatization.
Vectorization:
- Use the existing transformer-based model (e.g., text-embedding-ada-002) to generate vector embeddings for the text data.
- Ensure that each embedding is stored with metadata, such as title, source (Indico/public), and date.
Indexing in Vector Database:
- Update the vector database schema to include these new data sources.
- Insert the new embeddings and metadata into the database in Pinecone.
Testing and Validation:
- Test queries to ensure proper retrieval of relevant documents from both internal and newly added public sources.
- Validate accuracy and relevance of the results to ensure the system is functioning as expected.

Proposed Code Changes:

API Integration:
Extend the current codebase for ingestion to integrate with the Indico API for fetching relevant events and document data.
Vectorization Pipeline:
Modify the existing preprocessing and vectorization pipeline to handle documents from both public and Indico sources.
Database Update:
Adjust the database schema to accommodate new metadata fields such as source and event_date.

References:

Indico API Documentation: https://docs.getindico.io/en/latest/
Pinecone Documentation: https://docs.pinecone.io/docs/

Priority:

Medium - Enhancing the vector database with these sources will greatly improve the overall retrieval quality and allow users to access a broader range of documents and presentations.

The text was updated successfully, but these errors were encountered:

karthik18495 added enhancement New feature or request help wanted Extra attention is needed labels Oct 1, 2024

karthik18495 self-assigned this Oct 1, 2024

karthik18495 mentioned this issue Oct 2, 2024

Can build a docker image? #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend the vector database. #2

Extend the vector database. #2

karthik18495 commented Oct 1, 2024 •

edited

Loading

Extend the vector database. #2

Extend the vector database. #2

Comments

karthik18495 commented Oct 1, 2024 • edited Loading

GitHub Issue: Extend Vector Database with Public and Indico Pages for EIC Information

Description:

Use Case:

Tasks:

Proposed Code Changes:

References:

Priority:

karthik18495 commented Oct 1, 2024 •

edited

Loading