You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
GitHub Issue: Extend Vector Database with Public and Indico Pages for EIC Information
Issue Title:
Extend the Vector Database to Include Information from Public and Indico Pages for EIC
Description:
The Electron-Ion Collider (EIC) project would benefit from expanding the existing vector database to include data from public and Indico pages. This will enhance the system's ability to retrieve relevant documents and presentations for users. By incorporating these sources, we can provide a more comprehensive dataset for retrieval and improve the contextual quality of the responses.
This issue proposes:
Collecting presentations and documents from EIC-related Indico events and public web pages.
Preprocessing the collected documents (PDFs, text, HTML) and converting them into embeddings.
Storing these embeddings along with metadata in the current vector database.
Use Case:
Users can query the extended vector database to retrieve specific EIC-related documents, presentations, or meeting notes, allowing them to discover both internal and public information from Indico and other public sources.
Tasks:
Data Collection:
Set up scripts to fetch data from Indico using its API, extracting event details, presentations, and associated documents.
Scrape public pages relevant to EIC (e.g., documentation pages, wikis) for documents, presentations, and other useful content.
Preprocessing:
Convert documents from multiple formats (PDFs, Word, HTML) into plain text using libraries like PyMuPDF or pdfminer.
Medium - Enhancing the vector database with these sources will greatly improve the overall retrieval quality and allow users to access a broader range of documents and presentations.
The text was updated successfully, but these errors were encountered:
GitHub Issue: Extend Vector Database with Public and Indico Pages for EIC Information
Issue Title:
Extend the Vector Database to Include Information from Public and Indico Pages for EIC
Description:
The Electron-Ion Collider (EIC) project would benefit from expanding the existing vector database to include data from public and Indico pages. This will enhance the system's ability to retrieve relevant documents and presentations for users. By incorporating these sources, we can provide a more comprehensive dataset for retrieval and improve the contextual quality of the responses.
This issue proposes:
Use Case:
Users can query the extended vector database to retrieve specific EIC-related documents, presentations, or meeting notes, allowing them to discover both internal and public information from Indico and other public sources.
Tasks:
Data Collection:
Preprocessing:
PyMuPDF
orpdfminer
.Vectorization:
text-embedding-ada-002
) to generate vector embeddings for the text data.Indexing in Vector Database:
Testing and Validation:
Proposed Code Changes:
API Integration:
Extend the current codebase for ingestion to integrate with the Indico API for fetching relevant events and document data.
Vectorization Pipeline:
Modify the existing preprocessing and vectorization pipeline to handle documents from both public and Indico sources.
Database Update:
Adjust the database schema to accommodate new metadata fields such as
source
andevent_date
.References:
Priority:
Medium - Enhancing the vector database with these sources will greatly improve the overall retrieval quality and allow users to access a broader range of documents and presentations.
The text was updated successfully, but these errors were encountered: