Document Embedding with Weaviate Cloud and OpenAI

This project embeds all files from the src/ directory into Weaviate Cloud using OpenAI's text-embedding-3-small model for semantic search and retrieval. It includes both a web frontend for direct searching and a CloudFlare Worker for API-based access with AI-powered question answering.

Overview

The embedding system processes 25 markdown files containing academic program documentation and stores them in Weaviate Cloud with vector embeddings generated by OpenAI's text-embedding-3-small model. The project also includes a CloudFlare Worker that provides semantic document search and AI-powered question answering using Weaviate and OpenAI GPT-4o-mini.

Files Structure

├── src/                          # Source documents to embed (25 .md files)
├── embed_files.py               # Main embedding script (with inline dependencies for uv)
├── run_tests.py                # Test runner for search quality validation
├── weaviate-search.js          # Concise frontend search client (114 lines)
├── search.html                 # Web search interface (iframe-ready)
├── iframe-test.html            # Demo page showing iframe integration
├── iframe-demo.html            # Additional iframe demo
├── qa-interface.html           # Q&A interface using CloudFlare Worker
├── cloudflare-worker.js        # CloudFlare Worker for semantic Q&A
├── wrangler.toml              # CloudFlare configuration
├── serve.py                   # HTTP server for local development
└── README.md                  # This documentation

Prerequisites

Weaviate Cloud Account: Sign up at Weaviate Cloud
OpenAI API Key: Get your API key from OpenAI Platform
Python 3.8+: Ensure Python is installed on your system

Setup Instructions

1. Quick Start with uv (Recommended)

The script includes inline dependencies and can be run directly with uv:

uv run embed_files.py

2. Configure Environment Variables

Copy the example environment file and fill in your credentials:

cp .env.example .env

Edit the .env file with your actual credentials:

# Weaviate Cloud Configuration
WEAVIATE_URL=https://your-cluster-name.c0.asia-southeast1.gcp.weaviate.cloud
WEAVIATE_API_KEY=your_weaviate_api_key_here

# OpenAI Configuration
OPENAI_API_KEY=your_openai_api_key_here

Important: The .env file is ignored by git to prevent accidental secret commits.

3. Run the Embedding Script

Execute the main script to embed all documents:

With uv (recommended):

uv run embed_files.py

If you haven't set environment variables, the script will prompt you for:

Weaviate Cloud URL
Weaviate API Key
OpenAI API Key

How It Works

1. Document Processing

Scans all files in the src/ directory
Reads file content and metadata (filename, path, size, extension)
Generates SHA256 hash for duplicate detection

2. Weaviate Schema

The script creates a Document collection with the following properties:

filename: Name of the source file
filepath: Full path to the source file
content: Complete file content
file_size: File size in bytes
content_hash: SHA256 hash for duplicate detection
file_extension: File extension (.md)

3. Vector Embeddings

Uses OpenAI's text-embedding-3-small model (1536 dimensions)
Automatically generates embeddings for the content field
Enables semantic search capabilities

4. Duplicate Handling

Checks content hash before insertion
Skips files with identical content to avoid duplicates

Embedding Results

✅ Successfully embedded all 25 documents with the following results:

Total Documents: 25/25 (100% success rate)
Model Used: OpenAI text-embedding-3-small (1536 dimensions)
Vector Database: Weaviate Cloud
Processing Time: ~20 seconds for all documents
Duplicate Detection: SHA256 content hashing implemented
Error Rate: 0% (all files processed successfully)

Document Statistics

File Types: All Markdown (.md) files
Content: Academic program documentation
Average File Size: ~2.5KB per document
Total Content: ~60KB of text embedded
Estimated Embedding Cost: ~$0.02 (OpenAI API)

Source Documents

The src/ directory contains 25 academic program documentation files:

Academic Documents for students.md
Academic aspects.md
Admission to the programme.md
Alumni Details.md
Apprenticeship in the BS level.md
Changes in project grading.md
Course registration - steps involved.md
Courses in the programme.md
Credit Clearing Capability.md
Credit Transfer.md
Design of certificates for the 4 levels of the program.md
Direct Entry into Diploma programme.md
Eligibility Criteria Prize.md
Fees for the entire programme.md
Flexibility.md
Highlights of the programme.md
Learner Life Cycle.md
Learning paths available.md
New Rules for Foundation & Diploma Level Completion.md
Non Academic Rules.md
Pathways to get admission to Masters.md
Re Entry after Diploma.md
Software and Hardware Requirements.md
Timeline for original certificate.md
intro.md

Verifying the Embedded Documents

After running the embedding script, you can verify that all documents were successfully embedded using curl commands:

Check Total Document Count

curl -s -X POST \
  'https://your-cluster-url.weaviate.network/v1/graphql' \
  -H 'Authorization: Bearer YOUR_WEAVIATE_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
      "query": "{ Aggregate { Document { meta { count } } } }"
  }' | python3 -m json.tool

List All Document Filenames

curl -s -X POST \
  'https://your-cluster-url.weaviate.network/v1/graphql' \
  -H 'Authorization: Bearer YOUR_WEAVIATE_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
      "query": "{ Get { Document { filename } } }"
  }' | python3 -c "import sys, json; data=json.load(sys.stdin); docs=data['data']['Get']['Document']; print('\\n'.join([f'{i+1:2}. {doc[\"filename\"]}' for i, doc in enumerate(docs)]))"

View Sample Documents

curl -s -X GET \
  'https://your-cluster-url.weaviate.network/v1/objects?class=Document&limit=3' \
  -H 'Authorization: Bearer YOUR_WEAVIATE_API_KEY' \
  -H 'Content-Type: application/json' | python3 -m json.tool

Querying the Embedded Documents

After embedding, you can query the documents using Weaviate's GraphQL API or Python client. Here are some examples:

Semantic Search Example

import weaviate

client = weaviate.connect_to_weaviate_cloud(
    cluster_url="your-weaviate-url",
    auth_credentials=weaviate.AuthApiKey("your-api-key"),
    headers={"X-OpenAI-Api-Key": "your-openai-key"}
)

collection = client.collections.get("Document")

# Semantic search
response = collection.query.near_text(
    query="admission requirements",
    limit=5
)

for obj in response.objects:
    print(f"File: {obj.properties['filename']}")
    print(f"Content snippet: {obj.properties['content'][:200]}...")
    print("---")

client.close()

Hybrid Search Example

# Combine semantic and keyword search
response = collection.query.hybrid(
    query="course registration process",
    limit=3
)

Features

✅ Automatic Schema Creation: Creates Weaviate schema if it doesn't exist
✅ Duplicate Detection: Uses content hashing to avoid duplicate embeddings
✅ Error Handling: Comprehensive error handling and logging
✅ Batch Processing: Efficiently processes all files in the directory
✅ Metadata Preservation: Stores file metadata alongside content
✅ OpenAI Integration: Uses latest text-embedding-3-small model
✅ Environment Configuration: Flexible credential management
✅ uv Support: PEP 723 inline dependencies for modern Python tooling
✅ Full Document View: Expandable previews with complete document content
✅ Concise Codebase: Optimized JavaScript client (114 lines)

Troubleshooting

Common Issues

Connection Errors
- Verify your Weaviate Cloud URL and API key
- Check network connectivity
OpenAI API Errors
- Ensure your OpenAI API key is valid and has sufficient credits
- Check rate limits if processing many files
File Reading Errors
- Ensure all files in src/ are readable
- Check file encoding (UTF-8 expected)

Logging

The script provides detailed logging output. Check the console for:

Connection status
Processing progress
Success/failure counts
Error details

Cost Considerations

OpenAI API Costs

Model: text-embedding-3-small
Pricing: ~$0.00002 per 1K tokens
Estimated cost for 25 documents: $0.01-0.05 (depending on document length)

Weaviate Cloud Costs

Free tier available for development
Usage-based pricing for production

Security Notes

Never commit API keys to version control - API keys are entered by users in the web interface
Local storage only - Keys are stored in browser localStorage, not transmitted to any server
Environment variables - Use .env file for embedding script credentials
Key rotation - Consider rotating API keys regularly
Minimal permissions - Restrict API key permissions where possible
Public endpoint - Weaviate cluster URL is public (no authentication risk)

Testing and Quality Assurance

The project includes comprehensive test suites to validate search quality and document retrieval:

Running Quality Tests

Execute the test suite to verify search functionality:

uv run run_tests.py

Test Results

✅ All tests passing with excellent performance:

Total Tests: 10/10 passed (100% pass rate)
Keyword Accuracy: 95% average score
File Relevance: 100% accuracy
Average Response Time: ~676ms per query
Search Coverage: Tests for all major document categories

Test Categories

The test suite covers:

Basic Information Retrieval: Admission, fees, requirements
Academic Policies: Registration, prerequisites, certificates
Program Structure: Learning paths, flexibility, re-entry
Technical Requirements: Software, hardware specifications
Complex Queries: Multi-concept searches and comparisons

Alternative Testing

For comprehensive search quality testing, use the included test runner:

# Run all quality tests
uv run run_tests.py

Web Frontend

A complete web interface is included for searching the embedded documents:

🌐 Running the Web Interface

# Option 1: Use included development server
python3 serve.py

# Option 2: Use Python's built-in server
python3 -m http.server 8000

# Open in browser: http://localhost:8000/search.html

📱 Features

🔍 Semantic Search: AI-powered document search using OpenAI embeddings
📱 Responsive Design: Works on desktop, tablet, and mobile devices
🖼️ Iframe Ready: Optimized for embedding in other websites
⚡ Real-time Results: Instant search with relevance scoring
🎨 Professional UI: Clean Bootstrap-only design with full document preview
💾 Local Storage: API keys stored securely in browser

🔧 API Requirements

The frontend requires both API keys because:

Weaviate API Key: Authenticates with your Weaviate Cloud cluster
OpenAI API Key: Required for Weaviate to vectorize search queries in real-time

Why OpenAI key? When you search, Weaviate uses OpenAI's embedding model to convert your search text into vectors that can be compared with the stored document vectors.

🖼️ Embedding in Your Website

<iframe src="http://your-domain.com/search.html" 
        width="100%" 
        height="800" 
        frameborder="0"
        title="Academic Document Search">
</iframe>

Test the iframe integration: Open iframe-test.html to see how it looks embedded.

🤖 Embeddable Q&A Chatbot

Host qa-interface.html on your site and embed it using:

<iframe src="qa-interface.html" width="400" height="400" style="border:0" title="IITM Q&A"></iframe>

See index.html for a live example and additional guidance.

🛠️ JavaScript API

You can also use the JavaScript client directly:

// Include the library
<script src="weaviate-search.js"></script>

// Initialize and search (concise API)
const searcher = new WeaviateSearch();
searcher.setCredentials('your-weaviate-key');
searcher.setOpenAIKey('your-openai-key');

const results = await searcher.searchDocuments('admission requirements');
console.log(results);

CloudFlare Worker for Semantic Q&A

The project includes a CloudFlare Worker that provides semantic document search and AI-powered question answering using Weaviate and OpenAI GPT-4o-mini.

Worker Features

Cross-Origin Support: Handles CORS for web applications
Semantic Search: Uses Weaviate vector database for document retrieval
Streaming Responses: Real-time text/event-stream output
AI-Powered Answers: GPT-4o-mini generates contextual responses
Configurable Results: Customizable number of documents (ndocs)

API Endpoint

POST `/answer`

Request:

{
  "q": "What are the admission requirements?",
  "ndocs": 5  // optional, default = 5
}

Response: text/event-stream

data: {"type": "document", "relevance": 0.95, "text": "content...", "link": "https://github.com/..."}

data: {"type": "document", "relevance": 0.87, "text": "content...", "link": "https://github.com/..."}

data: {"type": "chunk", "text": "Based on the documents, the admission requirements are..."}

data: {"type": "chunk", "text": " You need to have completed..."}

Worker Setup Instructions

1. Install Wrangler CLI

npm install -g wrangler

2. Configure Environment

Copy the example environment file:

cp .dev.vars.example .dev.vars

Edit .dev.vars with your actual API keys:

WEAVIATE_URL=https://your-cluster.c0.asia-southeast1.gcp.weaviate.cloud
WEAVIATE_API_KEY=your_weaviate_api_key
OPENAI_API_KEY=your_openai_api_key

3. Deploy

Automated deployment:

./deploy-worker.sh

Manual deployment:

# Set production secrets
wrangler secret put WEAVIATE_URL
wrangler secret put WEAVIATE_API_KEY  
wrangler secret put OPENAI_API_KEY

# Deploy
wrangler deploy

4. Local Development

wrangler dev

Your worker will be available at http://localhost:8787

Worker Testing

Test the worker with curl:

curl -X POST https://your-worker-url/answer \
  -H 'Content-Type: application/json' \
  -d '{"q": "What are the admission requirements?", "ndocs": 3}'

Worker Environment Variables

Variable	Description
`WEAVIATE_URL`	Your Weaviate cluster URL
`WEAVIATE_API_KEY`	Weaviate API key
`OPENAI_API_KEY`	OpenAI API key for embeddings and chat completion

Worker Configuration

The worker is configured via wrangler.toml:

Compatibility Date: 2024-01-01
Node.js Compatibility: Enabled for fetch streaming
Memory: Standard (128MB)
CPU Time: Standard (10ms)

Worker Usage Examples

JavaScript/Fetch

const response = await fetch('https://your-worker-url/answer', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    q: 'How do I register for courses?',
    ndocs: 3
  })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  
  const chunk = decoder.decode(value);
  // Process server-sent events
  console.log(chunk);
}

Worker Response Format

Document Events

{
  "type": "document",
  "relevance": 0.95,
  "text": "Full document content...",
  "link": "https://github.com/user/repo/blob/main/src/filename.md"
}

Answer Chunks

{
  "type": "chunk", 
  "text": "Partial answer text..."
}

Worker Error Handling

The worker returns appropriate HTTP status codes:

200: Success with streaming response
400: Bad request (missing "q" parameter)
404: Invalid endpoint
500: Internal server error

Worker Security Notes

API keys are stored as CloudFlare secrets (encrypted)
CORS is enabled for cross-origin requests
No user authentication required
Rate limiting depends on CloudFlare plan

Worker Troubleshooting

Common Issues

Worker not responding: Check deployment status with wrangler status
API errors: Verify environment variables are set correctly
CORS issues: Ensure preflight OPTIONS requests are handled
Streaming problems: Check for proper content-type headers

Debug Commands

# View worker logs
wrangler tail

# Check environment variables
wrangler secret list

# Test locally
wrangler dev --local

Next Steps

After embedding, consider:

Building a search interface
Implementing retrieval-augmented generation (RAG)
Adding more document types
Setting up automated re-embedding workflows
Implementing user authentication for queries
Expanding test coverage for edge cases

Support

For issues related to:

Weaviate: Weaviate Documentation
OpenAI: OpenAI Documentation
This Script: Check logs and error messages for troubleshooting

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cloudflare-worker.js		cloudflare-worker.js
embed_files.py		embed_files.py
index.html		index.html
qa-interface.html		qa-interface.html
run_tests.py		run_tests.py
search.html		search.html
weaviate-search.js		weaviate-search.js
wrangler.toml		wrangler.toml

License

prudhvi1709/iitmdocs

Folders and files

Latest commit

History

Repository files navigation

Document Embedding with Weaviate Cloud and OpenAI

Overview

Files Structure

Prerequisites

Setup Instructions

1. Quick Start with uv (Recommended)

2. Configure Environment Variables

3. Run the Embedding Script

How It Works

1. Document Processing

2. Weaviate Schema

3. Vector Embeddings

4. Duplicate Handling

Embedding Results

Document Statistics

Source Documents

Verifying the Embedded Documents

Check Total Document Count

List All Document Filenames

View Sample Documents

Querying the Embedded Documents

Semantic Search Example

Hybrid Search Example

Features

Troubleshooting

Common Issues

Logging

Cost Considerations

OpenAI API Costs

Weaviate Cloud Costs

Security Notes

Testing and Quality Assurance

Running Quality Tests

Test Results

Test Categories

Alternative Testing

Web Frontend

🌐 Running the Web Interface

📱 Features

🔧 API Requirements

🖼️ Embedding in Your Website

🤖 Embeddable Q&A Chatbot

🛠️ JavaScript API

CloudFlare Worker for Semantic Q&A

Worker Features

API Endpoint

POST /answer

Worker Setup Instructions

1. Install Wrangler CLI

2. Configure Environment

3. Deploy

4. Local Development

Worker Testing

Worker Environment Variables

Worker Configuration

Worker Usage Examples

JavaScript/Fetch

Worker Response Format

Document Events

Answer Chunks

Worker Error Handling

Worker Security Notes

Worker Troubleshooting

Common Issues

Debug Commands

Next Steps

Support

License

About

Resources

License

Uh oh!

Stars

Watchers

POST `/answer`

Packages