A custom-built Retrieval-Augmented Generation (RAG) system designed specifically for processing and querying PDF documents with advanced support for visual content like tables, charts, and mathematical formulas. Built entirely from scratch without using frameworks like LangChain, this system demonstrates core RAG principles and advanced vision processing capabilities.
- Framework-Free Architecture: Built from the ground up using only core libraries
- Vision-Enhanced Processing: GPT-4 Vision API integration for analyzing charts, tables, and diagrams
- Intelligent Text Chunking: Smart overlapping chunking with sentence boundary detection
- FAISS Vector Search: High-performance similarity search for document retrieval
- Multi-Modal Queries: Handles both text and visual content queries
- RESTful API: Complete FastAPI backend with interactive documentation
- Web Interface: Modern Streamlit UI for easy document interaction
- Comprehensive Testing: Built-in test suite to verify system functionality
The system follows a modular architecture with clear separation of concerns:
PDF Input โ Text Extraction โ Vision Analysis โ Chunking โ Embedding โ Vector Storage โ Retrieval โ Generation โ Response
- PDF Processing Pipeline: Extracts both text and visual content from PDFs
- Vision Processing: Uses GPT-4 Vision to analyze images, tables, and charts
- Text Chunking: Intelligently splits documents into overlapping chunks
- Embedding Generation: Creates vector representations using OpenAI embeddings
- Vector Storage: FAISS-based high-performance similarity search
- Query Processing: Retrieves relevant chunks and generates responses
- API Layer: RESTful endpoints for programmatic access
- Web Interface: User-friendly Streamlit application
| File | Purpose | Key Functionality |
|---|---|---|
config.py |
Configuration Management | Centralizes all system settings, API keys, and parameters |
main.py |
Command Line Interface | Entry point for direct PDF ingestion and querying |
rag_pipeline.py |
Main Orchestration | Coordinates the entire RAG workflow from ingestion to response |
| File | Purpose | Key Functionality |
|---|---|---|
pdf_extractor.py |
PDF Content Extraction | Extracts text and converts pages to images for vision processing |
vision_processor.py |
Visual Content Analysis | Uses GPT-4 Vision API to analyze charts, tables, and diagrams |
chunker.py |
Text Chunking | Intelligently splits documents into overlapping chunks |
| File | Purpose | Key Functionality |
|---|---|---|
embedder.py |
Embedding Generation | Creates vector representations using OpenAI's embedding models |
vector_store.py |
Vector Database | FAISS-based storage and similarity search for document chunks |
retriever.py |
Document Retrieval | Finds most relevant document chunks for user queries |
generator.py |
Response Generation | Uses GPT models to generate natural language responses |
| File | Purpose | Key Functionality |
|---|---|---|
api.py |
REST API Backend | FastAPI server with endpoints for ingestion, querying, and management |
streamlit_app.py |
Web Interface | Interactive Streamlit application for document upload and querying |
chunk_viewer.py |
Data Visualization | Tools for viewing, searching, and analyzing processed chunks |
| File | Purpose | Key Functionality |
|---|---|---|
test_system.py |
System Testing | Comprehensive test suite to verify all components work correctly |
requirements.txt |
Dependencies | Lists all required Python packages and system dependencies |
| File | Purpose | Key Functionality |
|---|---|---|
.env |
Environment Variables | Stores API keys and configuration (create this file) |
start_api.sh |
API Startup Script | Starts the FastAPI backend server |
start_streamlit.sh |
UI Startup Script | Starts the Streamlit web interface |
start_both.sh |
Combined Startup | Starts both API and UI simultaneously |
graph TD
A[PDF Upload] --> B[PDF Extractor]
B --> C[Text Pages]
B --> D[Image Pages]
C --> E[Text Chunker]
D --> F[Vision Processor]
F --> G[GPT-4 Vision Analysis]
G --> H[Analyzed Visual Content]
E --> I[Text Chunks]
H --> I
I --> J[Embedder]
J --> K[OpenAI Embeddings]
K --> L[Embedded Chunks]
L --> M[Vector Store]
M --> N[FAISS Index]
N --> O[Persistent Storage]
P[User Query] --> Q[Retriever]
Q --> R[Query Embedding]
R --> S[Similarity Search]
S --> T[Relevant Chunks]
T --> U[Context Formatter]
U --> V[Generator]
V --> W[GPT Response]
W --> X[Final Answer]
O -.-> Q
-
Document Ingestion:
- PDF uploaded through API or command line
- Text extracted using PyPDF2
- Pages converted to high-resolution images
- Images analyzed using GPT-4 Vision API
- Both text and visual content chunked intelligently
-
Vector Processing:
- Text chunks converted to embeddings using OpenAI's
text-embedding-3-small - Embeddings stored in FAISS index for fast similarity search
- Metadata preserved for source attribution
- Text chunks converted to embeddings using OpenAI's
-
Query Processing:
- User query embedded using same embedding model
- FAISS searches for most similar document chunks
- Context assembled from retrieved chunks
- GPT-4 generates natural language response
-
Multi-Modal Support:
- Visual queries detected and prioritized
- Vision-analyzed content included in retrieval
- Mathematical formulas and diagrams processed
- Tables and charts converted to searchable text
# System dependencies (macOS/Linux)
# Install poppler for PDF processing
brew install poppler tesseract
# Install Python dependencies
pip install -r requirements.txt-
Create Environment File:
cp .env.example .env
-
Add Your API Key:
# Edit .env file OPENAI_API_KEY=sk-your-openai-api-key-here
# Ingest a PDF
python main.py --ingest document.pdf
# Query interactively
python main.py --interactive
# Single query
python main.py --query "What is this document about?"# Start both API and Streamlit
./start_both.sh
# Or start individually:
python api.py # API on http://localhost:8000
streamlit run streamlit_app.py # UI on http://localhost:8501python api.py
# Access at http://localhost:8000
# Interactive docs at http://localhost:8000/docs# Basic ingestion with vision analysis
python main.py --ingest research_paper.pdf --no-vision
# Query with custom result count
python main.py --query "What are the main findings?" --top-k 10
# Interactive mode for multiple queries
python main.py --interactiveimport requests
# Ingest a PDF
with open('document.pdf', 'rb') as f:
response = requests.post('http://localhost:8000/ingest',
files={'file': f},
params={'use_vision': True})
# Query the system
query_response = requests.post('http://localhost:8000/query',
json={'question': 'What is the methodology?',
'top_k': 5})
result = query_response.json()
print(f"Answer: {result['answer']}")
print(f"Sources: {result['sources']}")- Document Upload: Drag-and-drop PDF files
- Vision Toggle: Enable/disable visual content analysis
- Query Interface: Natural language questions
- Source Display: Shows which pages and content types were used
- Query History: Track previous questions and answers
- System Settings: Monitor configuration and reset system
All configuration is centralized in config.py:
class Config:
# OpenAI Configuration
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
EMBEDDING_MODEL = "text-embedding-3-small" # 1536 dimensions
CHAT_MODEL = "gpt-4o-mini" # Fast responses
VISION_MODEL = "gpt-4o-mini" # Image analysis
# RAG Configuration
CHUNK_SIZE = 1000 # Characters per chunk
CHUNK_OVERLAP = 200 # Overlap between chunks
TOP_K_RESULTS = 5 # Default results to retrieve
# Processing Settings
DPI = 300 # PDF to image resolution
TEMP_DIR = "temp_processing" # Temporary file locationRun the comprehensive test suite:
python test_system.pyThis tests:
- API connectivity and status
- All REST endpoints functionality
- Chunk processing and retrieval
- Embedding generation and storage
- Query processing and response generation
- Mathematical Formulas: OCR and formula recognition
- Data Visualization: Chart and graph analysis
- Table Extraction: Structured data from tabular content
- Diagram Understanding: Flow charts and process diagrams
- Page-Specific Queries: "What is on page 5?"
- Visual Queries: "What does the chart show?"
- Context-Aware: Prioritizes relevant content types
- Similarity Scoring: Confidence scores for retrieved chunks
- Error Handling: Comprehensive error management
- Logging: Detailed processing logs
- Cleanup: Automatic temporary file management
- Persistence: Save/load vector indices
- Concurrent Processing: Batch operations for efficiency
POST /ingest- Upload and process PDF filesPOST /query- Ask questions about documentsGET /status- System status and configuration
GET /chunks- View processed document chunksGET /chunks/{id}- View specific chunk detailsGET /chunks/statistics- Chunk processing statisticsGET /chunks/search- Search chunks by text contentGET /embeddings- View embedding data and metadata
DELETE /reset- Clear current index and reset system
-
API Key Missing:
# Create .env file with your OpenAI API key echo "OPENAI_API_KEY=sk-your-key-here" > .env
-
PDF Processing Errors:
# Install system dependencies brew install poppler tesseract # macOS sudo apt install poppler-utils tesseract-ocr # Ubuntu
-
Memory Issues:
- Reduce chunk size in config
- Process documents in batches
- Use
--no-visionflag for faster processing
-
Vision Analysis Failing:
- Check OpenAI API quotas
- Verify image quality (300 DPI recommended)
- Some complex diagrams may need manual review
- Embedding Generation: ~100 chunks/second
- Vision Processing: ~1-2 seconds per page
- Query Response: <2 seconds for typical queries
- Memory Usage: ~8MB per 1000 chunks stored
- Storage: FAISS index + metadata ~50% of original PDF size
- Multi-PDF indexing and cross-document search
- Custom embedding models (local inference)
- Advanced query routing (keyword vs semantic)
- Document summarization and key phrase extraction
- Batch processing for large document collections
- Integration with document management systems
This project is built for educational and research purposes. The architecture demonstrates core RAG principles and can be adapted for various document processing applications.
Built with โค๏ธ using Python, FastAPI, Streamlit, OpenAI API, and FAISS