Skip to content

prudhvi1709/iitmdocs

Repository files navigation

Document Embedding with Weaviate Cloud and OpenAI

This project embeds all files from the src/ directory into Weaviate Cloud using OpenAI's text-embedding-3-small model for semantic search and retrieval. It includes both a web frontend for direct searching and a CloudFlare Worker for API-based access with AI-powered question answering.

Overview

The embedding system processes 25 markdown files containing academic program documentation and stores them in Weaviate Cloud with vector embeddings generated by OpenAI's text-embedding-3-small model. The project also includes a CloudFlare Worker that provides semantic document search and AI-powered question answering using Weaviate and OpenAI GPT-4o-mini.

Files Structure

├── src/                          # Source documents to embed (25 .md files)
├── embed_files.py               # Main embedding script (with inline dependencies for uv)
├── run_tests.py                # Test runner for search quality validation
├── weaviate-search.js          # Concise frontend search client (114 lines)
├── search.html                 # Web search interface (iframe-ready)
├── iframe-test.html            # Demo page showing iframe integration
├── iframe-demo.html            # Additional iframe demo
├── qa-interface.html           # Q&A interface using CloudFlare Worker
├── cloudflare-worker.js        # CloudFlare Worker for semantic Q&A
├── wrangler.toml              # CloudFlare configuration
├── serve.py                   # HTTP server for local development
└── README.md                  # This documentation

Prerequisites

  1. Weaviate Cloud Account: Sign up at Weaviate Cloud
  2. OpenAI API Key: Get your API key from OpenAI Platform
  3. Python 3.8+: Ensure Python is installed on your system

Setup Instructions

1. Quick Start with uv (Recommended)

The script includes inline dependencies and can be run directly with uv:

uv run embed_files.py

2. Configure Environment Variables

Copy the example environment file and fill in your credentials:

cp .env.example .env

Edit the .env file with your actual credentials:

# Weaviate Cloud Configuration
WEAVIATE_URL=https://your-cluster-name.c0.asia-southeast1.gcp.weaviate.cloud
WEAVIATE_API_KEY=your_weaviate_api_key_here

# OpenAI Configuration
OPENAI_API_KEY=your_openai_api_key_here

Important: The .env file is ignored by git to prevent accidental secret commits.

3. Run the Embedding Script

Execute the main script to embed all documents:

With uv (recommended):

uv run embed_files.py

If you haven't set environment variables, the script will prompt you for:

  • Weaviate Cloud URL
  • Weaviate API Key
  • OpenAI API Key

How It Works

1. Document Processing

  • Scans all files in the src/ directory
  • Reads file content and metadata (filename, path, size, extension)
  • Generates SHA256 hash for duplicate detection

2. Weaviate Schema

The script creates a Document collection with the following properties:

  • filename: Name of the source file
  • filepath: Full path to the source file
  • content: Complete file content
  • file_size: File size in bytes
  • content_hash: SHA256 hash for duplicate detection
  • file_extension: File extension (.md)

3. Vector Embeddings

  • Uses OpenAI's text-embedding-3-small model (1536 dimensions)
  • Automatically generates embeddings for the content field
  • Enables semantic search capabilities

4. Duplicate Handling

  • Checks content hash before insertion
  • Skips files with identical content to avoid duplicates

Embedding Results

Successfully embedded all 25 documents with the following results:

  • Total Documents: 25/25 (100% success rate)
  • Model Used: OpenAI text-embedding-3-small (1536 dimensions)
  • Vector Database: Weaviate Cloud
  • Processing Time: ~20 seconds for all documents
  • Duplicate Detection: SHA256 content hashing implemented
  • Error Rate: 0% (all files processed successfully)

Document Statistics

  • File Types: All Markdown (.md) files
  • Content: Academic program documentation
  • Average File Size: ~2.5KB per document
  • Total Content: ~60KB of text embedded
  • Estimated Embedding Cost: ~$0.02 (OpenAI API)

Source Documents

The src/ directory contains 25 academic program documentation files:

  • Academic Documents for students.md
  • Academic aspects.md
  • Admission to the programme.md
  • Alumni Details.md
  • Apprenticeship in the BS level.md
  • Changes in project grading.md
  • Course registration - steps involved.md
  • Courses in the programme.md
  • Credit Clearing Capability.md
  • Credit Transfer.md
  • Design of certificates for the 4 levels of the program.md
  • Direct Entry into Diploma programme.md
  • Eligibility Criteria Prize.md
  • Fees for the entire programme.md
  • Flexibility.md
  • Highlights of the programme.md
  • Learner Life Cycle.md
  • Learning paths available.md
  • New Rules for Foundation & Diploma Level Completion.md
  • Non Academic Rules.md
  • Pathways to get admission to Masters.md
  • Re Entry after Diploma.md
  • Software and Hardware Requirements.md
  • Timeline for original certificate.md
  • intro.md

Verifying the Embedded Documents

After running the embedding script, you can verify that all documents were successfully embedded using curl commands:

Check Total Document Count

curl -s -X POST \
  'https://your-cluster-url.weaviate.network/v1/graphql' \
  -H 'Authorization: Bearer YOUR_WEAVIATE_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
      "query": "{ Aggregate { Document { meta { count } } } }"
  }' | python3 -m json.tool

List All Document Filenames

curl -s -X POST \
  'https://your-cluster-url.weaviate.network/v1/graphql' \
  -H 'Authorization: Bearer YOUR_WEAVIATE_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
      "query": "{ Get { Document { filename } } }"
  }' | python3 -c "import sys, json; data=json.load(sys.stdin); docs=data['data']['Get']['Document']; print('\\n'.join([f'{i+1:2}. {doc[\"filename\"]}' for i, doc in enumerate(docs)]))"

View Sample Documents

curl -s -X GET \
  'https://your-cluster-url.weaviate.network/v1/objects?class=Document&limit=3' \
  -H 'Authorization: Bearer YOUR_WEAVIATE_API_KEY' \
  -H 'Content-Type: application/json' | python3 -m json.tool

Querying the Embedded Documents

After embedding, you can query the documents using Weaviate's GraphQL API or Python client. Here are some examples:

Semantic Search Example

import weaviate

client = weaviate.connect_to_weaviate_cloud(
    cluster_url="your-weaviate-url",
    auth_credentials=weaviate.AuthApiKey("your-api-key"),
    headers={"X-OpenAI-Api-Key": "your-openai-key"}
)

collection = client.collections.get("Document")

# Semantic search
response = collection.query.near_text(
    query="admission requirements",
    limit=5
)

for obj in response.objects:
    print(f"File: {obj.properties['filename']}")
    print(f"Content snippet: {obj.properties['content'][:200]}...")
    print("---")

client.close()

Hybrid Search Example

# Combine semantic and keyword search
response = collection.query.hybrid(
    query="course registration process",
    limit=3
)

Features

  • Automatic Schema Creation: Creates Weaviate schema if it doesn't exist
  • Duplicate Detection: Uses content hashing to avoid duplicate embeddings
  • Error Handling: Comprehensive error handling and logging
  • Batch Processing: Efficiently processes all files in the directory
  • Metadata Preservation: Stores file metadata alongside content
  • OpenAI Integration: Uses latest text-embedding-3-small model
  • Environment Configuration: Flexible credential management
  • uv Support: PEP 723 inline dependencies for modern Python tooling
  • Full Document View: Expandable previews with complete document content
  • Concise Codebase: Optimized JavaScript client (114 lines)

Troubleshooting

Common Issues

  1. Connection Errors

    • Verify your Weaviate Cloud URL and API key
    • Check network connectivity
  2. OpenAI API Errors

    • Ensure your OpenAI API key is valid and has sufficient credits
    • Check rate limits if processing many files
  3. File Reading Errors

    • Ensure all files in src/ are readable
    • Check file encoding (UTF-8 expected)

Logging

The script provides detailed logging output. Check the console for:

  • Connection status
  • Processing progress
  • Success/failure counts
  • Error details

Cost Considerations

OpenAI API Costs

  • Model: text-embedding-3-small
  • Pricing: ~$0.00002 per 1K tokens
  • Estimated cost for 25 documents: $0.01-0.05 (depending on document length)

Weaviate Cloud Costs

  • Free tier available for development
  • Usage-based pricing for production

Security Notes

  • Never commit API keys to version control - API keys are entered by users in the web interface
  • Local storage only - Keys are stored in browser localStorage, not transmitted to any server
  • Environment variables - Use .env file for embedding script credentials
  • Key rotation - Consider rotating API keys regularly
  • Minimal permissions - Restrict API key permissions where possible
  • Public endpoint - Weaviate cluster URL is public (no authentication risk)

Testing and Quality Assurance

The project includes comprehensive test suites to validate search quality and document retrieval:

Running Quality Tests

Execute the test suite to verify search functionality:

uv run run_tests.py

Test Results

All tests passing with excellent performance:

  • Total Tests: 10/10 passed (100% pass rate)
  • Keyword Accuracy: 95% average score
  • File Relevance: 100% accuracy
  • Average Response Time: ~676ms per query
  • Search Coverage: Tests for all major document categories

Test Categories

The test suite covers:

  • Basic Information Retrieval: Admission, fees, requirements
  • Academic Policies: Registration, prerequisites, certificates
  • Program Structure: Learning paths, flexibility, re-entry
  • Technical Requirements: Software, hardware specifications
  • Complex Queries: Multi-concept searches and comparisons

Alternative Testing

For comprehensive search quality testing, use the included test runner:

# Run all quality tests
uv run run_tests.py

Web Frontend

A complete web interface is included for searching the embedded documents:

🌐 Running the Web Interface

# Option 1: Use included development server
python3 serve.py

# Option 2: Use Python's built-in server
python3 -m http.server 8000

# Open in browser: http://localhost:8000/search.html

📱 Features

  • 🔍 Semantic Search: AI-powered document search using OpenAI embeddings
  • 📱 Responsive Design: Works on desktop, tablet, and mobile devices
  • 🖼️ Iframe Ready: Optimized for embedding in other websites
  • ⚡ Real-time Results: Instant search with relevance scoring
  • 🎨 Professional UI: Clean Bootstrap-only design with full document preview
  • 💾 Local Storage: API keys stored securely in browser

🔧 API Requirements

The frontend requires both API keys because:

  • Weaviate API Key: Authenticates with your Weaviate Cloud cluster
  • OpenAI API Key: Required for Weaviate to vectorize search queries in real-time

Why OpenAI key? When you search, Weaviate uses OpenAI's embedding model to convert your search text into vectors that can be compared with the stored document vectors.

🖼️ Embedding in Your Website

<iframe src="http://your-domain.com/search.html" 
        width="100%" 
        height="800" 
        frameborder="0"
        title="Academic Document Search">
</iframe>

Test the iframe integration: Open iframe-test.html to see how it looks embedded.

🤖 Embeddable Q&A Chatbot

Host qa-interface.html on your site and embed it using:

<iframe src="qa-interface.html" width="400" height="400" style="border:0" title="IITM Q&A"></iframe>

See index.html for a live example and additional guidance.

🛠️ JavaScript API

You can also use the JavaScript client directly:

// Include the library
<script src="weaviate-search.js"></script>

// Initialize and search (concise API)
const searcher = new WeaviateSearch();
searcher.setCredentials('your-weaviate-key');
searcher.setOpenAIKey('your-openai-key');

const results = await searcher.searchDocuments('admission requirements');
console.log(results);

CloudFlare Worker for Semantic Q&A

The project includes a CloudFlare Worker that provides semantic document search and AI-powered question answering using Weaviate and OpenAI GPT-4o-mini.

Worker Features

  • Cross-Origin Support: Handles CORS for web applications
  • Semantic Search: Uses Weaviate vector database for document retrieval
  • Streaming Responses: Real-time text/event-stream output
  • AI-Powered Answers: GPT-4o-mini generates contextual responses
  • Configurable Results: Customizable number of documents (ndocs)

API Endpoint

POST /answer

Request:

{
  "q": "What are the admission requirements?",
  "ndocs": 5  // optional, default = 5
}

Response: text/event-stream

data: {"type": "document", "relevance": 0.95, "text": "content...", "link": "https://github.com/..."}

data: {"type": "document", "relevance": 0.87, "text": "content...", "link": "https://github.com/..."}

data: {"type": "chunk", "text": "Based on the documents, the admission requirements are..."}

data: {"type": "chunk", "text": " You need to have completed..."}

Worker Setup Instructions

1. Install Wrangler CLI

npm install -g wrangler

2. Configure Environment

Copy the example environment file:

cp .dev.vars.example .dev.vars

Edit .dev.vars with your actual API keys:

WEAVIATE_URL=https://your-cluster.c0.asia-southeast1.gcp.weaviate.cloud
WEAVIATE_API_KEY=your_weaviate_api_key
OPENAI_API_KEY=your_openai_api_key

3. Deploy

Automated deployment:

./deploy-worker.sh

Manual deployment:

# Set production secrets
wrangler secret put WEAVIATE_URL
wrangler secret put WEAVIATE_API_KEY  
wrangler secret put OPENAI_API_KEY

# Deploy
wrangler deploy

4. Local Development

wrangler dev

Your worker will be available at http://localhost:8787

Worker Testing

Test the worker with curl:

curl -X POST https://your-worker-url/answer \
  -H 'Content-Type: application/json' \
  -d '{"q": "What are the admission requirements?", "ndocs": 3}'

Worker Environment Variables

Variable Description
WEAVIATE_URL Your Weaviate cluster URL
WEAVIATE_API_KEY Weaviate API key
OPENAI_API_KEY OpenAI API key for embeddings and chat completion

Worker Configuration

The worker is configured via wrangler.toml:

  • Compatibility Date: 2024-01-01
  • Node.js Compatibility: Enabled for fetch streaming
  • Memory: Standard (128MB)
  • CPU Time: Standard (10ms)

Worker Usage Examples

JavaScript/Fetch

const response = await fetch('https://your-worker-url/answer', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    q: 'How do I register for courses?',
    ndocs: 3
  })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  
  const chunk = decoder.decode(value);
  // Process server-sent events
  console.log(chunk);
}

Worker Response Format

Document Events

{
  "type": "document",
  "relevance": 0.95,
  "text": "Full document content...",
  "link": "https://github.com/user/repo/blob/main/src/filename.md"
}

Answer Chunks

{
  "type": "chunk", 
  "text": "Partial answer text..."
}

Worker Error Handling

The worker returns appropriate HTTP status codes:

  • 200: Success with streaming response
  • 400: Bad request (missing "q" parameter)
  • 404: Invalid endpoint
  • 500: Internal server error

Worker Security Notes

  • API keys are stored as CloudFlare secrets (encrypted)
  • CORS is enabled for cross-origin requests
  • No user authentication required
  • Rate limiting depends on CloudFlare plan

Worker Troubleshooting

Common Issues

  1. Worker not responding: Check deployment status with wrangler status
  2. API errors: Verify environment variables are set correctly
  3. CORS issues: Ensure preflight OPTIONS requests are handled
  4. Streaming problems: Check for proper content-type headers

Debug Commands

# View worker logs
wrangler tail

# Check environment variables
wrangler secret list

# Test locally
wrangler dev --local

Next Steps

After embedding, consider:

  1. Building a search interface
  2. Implementing retrieval-augmented generation (RAG)
  3. Adding more document types
  4. Setting up automated re-embedding workflows
  5. Implementing user authentication for queries
  6. Expanding test coverage for edge cases

Support

For issues related to:

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

IIT Madras academic program document search using AI-powered semantic embeddings

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •