Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend the vector database. #2

Open
karthik18495 opened this issue Oct 1, 2024 · 0 comments
Open

Extend the vector database. #2

karthik18495 opened this issue Oct 1, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@karthik18495
Copy link
Member

karthik18495 commented Oct 1, 2024

GitHub Issue: Extend Vector Database with Public and Indico Pages for EIC Information

Issue Title:
Extend the Vector Database to Include Information from Public and Indico Pages for EIC


Description:

The Electron-Ion Collider (EIC) project would benefit from expanding the existing vector database to include data from public and Indico pages. This will enhance the system's ability to retrieve relevant documents and presentations for users. By incorporating these sources, we can provide a more comprehensive dataset for retrieval and improve the contextual quality of the responses.

This issue proposes:

  • Collecting presentations and documents from EIC-related Indico events and public web pages.
  • Preprocessing the collected documents (PDFs, text, HTML) and converting them into embeddings.
  • Storing these embeddings along with metadata in the current vector database.

Use Case:

Users can query the extended vector database to retrieve specific EIC-related documents, presentations, or meeting notes, allowing them to discover both internal and public information from Indico and other public sources.


Tasks:

  1. Data Collection:

    • Set up scripts to fetch data from Indico using its API, extracting event details, presentations, and associated documents.
    • Scrape public pages relevant to EIC (e.g., documentation pages, wikis) for documents, presentations, and other useful content.
  2. Preprocessing:

    • Convert documents from multiple formats (PDFs, Word, HTML) into plain text using libraries like PyMuPDF or pdfminer.
    • Apply NLP preprocessing steps: tokenization, stop-word removal, lemmatization.
  3. Vectorization:

    • Use the existing transformer-based model (e.g., text-embedding-ada-002) to generate vector embeddings for the text data.
    • Ensure that each embedding is stored with metadata, such as title, source (Indico/public), and date.
  4. Indexing in Vector Database:

    • Update the vector database schema to include these new data sources.
    • Insert the new embeddings and metadata into the database in Pinecone.
  5. Testing and Validation:

    • Test queries to ensure proper retrieval of relevant documents from both internal and newly added public sources.
    • Validate accuracy and relevance of the results to ensure the system is functioning as expected.

Proposed Code Changes:

  • API Integration:
    Extend the current codebase for ingestion to integrate with the Indico API for fetching relevant events and document data.

  • Vectorization Pipeline:
    Modify the existing preprocessing and vectorization pipeline to handle documents from both public and Indico sources.

  • Database Update:
    Adjust the database schema to accommodate new metadata fields such as source and event_date.


References:


Priority:

Medium - Enhancing the vector database with these sources will greatly improve the overall retrieval quality and allow users to access a broader range of documents and presentations.

@karthik18495 karthik18495 added enhancement New feature or request help wanted Extra attention is needed labels Oct 1, 2024
@karthik18495 karthik18495 self-assigned this Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
Status: Todo
Development

No branches or pull requests

1 participant