Skip to content

MadsDoodle/Tabular-RAG-from-scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

3 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Tabular-RAG System ๐Ÿ“š

A custom-built Retrieval-Augmented Generation (RAG) system designed specifically for processing and querying PDF documents with advanced support for visual content like tables, charts, and mathematical formulas. Built entirely from scratch without using frameworks like LangChain, this system demonstrates core RAG principles and advanced vision processing capabilities.

โœจ Key Features

  • Framework-Free Architecture: Built from the ground up using only core libraries
  • Vision-Enhanced Processing: GPT-4 Vision API integration for analyzing charts, tables, and diagrams
  • Intelligent Text Chunking: Smart overlapping chunking with sentence boundary detection
  • FAISS Vector Search: High-performance similarity search for document retrieval
  • Multi-Modal Queries: Handles both text and visual content queries
  • RESTful API: Complete FastAPI backend with interactive documentation
  • Web Interface: Modern Streamlit UI for easy document interaction
  • Comprehensive Testing: Built-in test suite to verify system functionality

๐Ÿ—๏ธ Architecture Overview

The system follows a modular architecture with clear separation of concerns:

PDF Input โ†’ Text Extraction โ†’ Vision Analysis โ†’ Chunking โ†’ Embedding โ†’ Vector Storage โ†’ Retrieval โ†’ Generation โ†’ Response

Core Components

  1. PDF Processing Pipeline: Extracts both text and visual content from PDFs
  2. Vision Processing: Uses GPT-4 Vision to analyze images, tables, and charts
  3. Text Chunking: Intelligently splits documents into overlapping chunks
  4. Embedding Generation: Creates vector representations using OpenAI embeddings
  5. Vector Storage: FAISS-based high-performance similarity search
  6. Query Processing: Retrieves relevant chunks and generates responses
  7. API Layer: RESTful endpoints for programmatic access
  8. Web Interface: User-friendly Streamlit application

๐Ÿ“ Project Structure

Core System Files

File Purpose Key Functionality
config.py Configuration Management Centralizes all system settings, API keys, and parameters
main.py Command Line Interface Entry point for direct PDF ingestion and querying
rag_pipeline.py Main Orchestration Coordinates the entire RAG workflow from ingestion to response

Document Processing

File Purpose Key Functionality
pdf_extractor.py PDF Content Extraction Extracts text and converts pages to images for vision processing
vision_processor.py Visual Content Analysis Uses GPT-4 Vision API to analyze charts, tables, and diagrams
chunker.py Text Chunking Intelligently splits documents into overlapping chunks

AI & Vector Processing

File Purpose Key Functionality
embedder.py Embedding Generation Creates vector representations using OpenAI's embedding models
vector_store.py Vector Database FAISS-based storage and similarity search for document chunks
retriever.py Document Retrieval Finds most relevant document chunks for user queries
generator.py Response Generation Uses GPT models to generate natural language responses

API & Web Interface

File Purpose Key Functionality
api.py REST API Backend FastAPI server with endpoints for ingestion, querying, and management
streamlit_app.py Web Interface Interactive Streamlit application for document upload and querying
chunk_viewer.py Data Visualization Tools for viewing, searching, and analyzing processed chunks

Utility & Testing

File Purpose Key Functionality
test_system.py System Testing Comprehensive test suite to verify all components work correctly
requirements.txt Dependencies Lists all required Python packages and system dependencies

Configuration & Scripts

File Purpose Key Functionality
.env Environment Variables Stores API keys and configuration (create this file)
start_api.sh API Startup Script Starts the FastAPI backend server
start_streamlit.sh UI Startup Script Starts the Streamlit web interface
start_both.sh Combined Startup Starts both API and UI simultaneously

๐Ÿ”„ System Flow Diagram

graph TD
    A[PDF Upload] --> B[PDF Extractor]
    B --> C[Text Pages]
    B --> D[Image Pages]

    C --> E[Text Chunker]
    D --> F[Vision Processor]

    F --> G[GPT-4 Vision Analysis]
    G --> H[Analyzed Visual Content]

    E --> I[Text Chunks]
    H --> I

    I --> J[Embedder]
    J --> K[OpenAI Embeddings]
    K --> L[Embedded Chunks]

    L --> M[Vector Store]
    M --> N[FAISS Index]
    N --> O[Persistent Storage]

    P[User Query] --> Q[Retriever]
    Q --> R[Query Embedding]
    R --> S[Similarity Search]
    S --> T[Relevant Chunks]

    T --> U[Context Formatter]
    U --> V[Generator]
    V --> W[GPT Response]
    W --> X[Final Answer]

    O -.-> Q
Loading

Detailed Flow Explanation

  1. Document Ingestion:

    • PDF uploaded through API or command line
    • Text extracted using PyPDF2
    • Pages converted to high-resolution images
    • Images analyzed using GPT-4 Vision API
    • Both text and visual content chunked intelligently
  2. Vector Processing:

    • Text chunks converted to embeddings using OpenAI's text-embedding-3-small
    • Embeddings stored in FAISS index for fast similarity search
    • Metadata preserved for source attribution
  3. Query Processing:

    • User query embedded using same embedding model
    • FAISS searches for most similar document chunks
    • Context assembled from retrieved chunks
    • GPT-4 generates natural language response
  4. Multi-Modal Support:

    • Visual queries detected and prioritized
    • Vision-analyzed content included in retrieval
    • Mathematical formulas and diagrams processed
    • Tables and charts converted to searchable text

๐Ÿš€ Installation & Setup

Prerequisites

# System dependencies (macOS/Linux)
# Install poppler for PDF processing
brew install poppler tesseract

# Install Python dependencies
pip install -r requirements.txt

Configuration

  1. Create Environment File:

    cp .env.example .env
  2. Add Your API Key:

    # Edit .env file
    OPENAI_API_KEY=sk-your-openai-api-key-here

Running the System

Option 1: Command Line (Direct Usage)

# Ingest a PDF
python main.py --ingest document.pdf

# Query interactively
python main.py --interactive

# Single query
python main.py --query "What is this document about?"

Option 2: Web Interface (Recommended)

# Start both API and Streamlit
./start_both.sh

# Or start individually:
python api.py                    # API on http://localhost:8000
streamlit run streamlit_app.py   # UI on http://localhost:8501

Option 3: API Only

python api.py
# Access at http://localhost:8000
# Interactive docs at http://localhost:8000/docs

๐Ÿ“– Usage Examples

Command Line Interface

# Basic ingestion with vision analysis
python main.py --ingest research_paper.pdf --no-vision

# Query with custom result count
python main.py --query "What are the main findings?" --top-k 10

# Interactive mode for multiple queries
python main.py --interactive

REST API Usage

import requests

# Ingest a PDF
with open('document.pdf', 'rb') as f:
    response = requests.post('http://localhost:8000/ingest',
                           files={'file': f},
                           params={'use_vision': True})

# Query the system
query_response = requests.post('http://localhost:8000/query',
                              json={'question': 'What is the methodology?',
                                    'top_k': 5})

result = query_response.json()
print(f"Answer: {result['answer']}")
print(f"Sources: {result['sources']}")

Web Interface Features

  • Document Upload: Drag-and-drop PDF files
  • Vision Toggle: Enable/disable visual content analysis
  • Query Interface: Natural language questions
  • Source Display: Shows which pages and content types were used
  • Query History: Track previous questions and answers
  • System Settings: Monitor configuration and reset system

๐Ÿ”ง Configuration Options

All configuration is centralized in config.py:

class Config:
    # OpenAI Configuration
    OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
    EMBEDDING_MODEL = "text-embedding-3-small"  # 1536 dimensions
    CHAT_MODEL = "gpt-4o-mini"                  # Fast responses
    VISION_MODEL = "gpt-4o-mini"                # Image analysis

    # RAG Configuration
    CHUNK_SIZE = 1000              # Characters per chunk
    CHUNK_OVERLAP = 200            # Overlap between chunks
    TOP_K_RESULTS = 5              # Default results to retrieve

    # Processing Settings
    DPI = 300                      # PDF to image resolution
    TEMP_DIR = "temp_processing"   # Temporary file location

๐Ÿงช Testing

Run the comprehensive test suite:

python test_system.py

This tests:

  • API connectivity and status
  • All REST endpoints functionality
  • Chunk processing and retrieval
  • Embedding generation and storage
  • Query processing and response generation

๐ŸŽฏ Advanced Features

Vision Processing

  • Mathematical Formulas: OCR and formula recognition
  • Data Visualization: Chart and graph analysis
  • Table Extraction: Structured data from tabular content
  • Diagram Understanding: Flow charts and process diagrams

Smart Retrieval

  • Page-Specific Queries: "What is on page 5?"
  • Visual Queries: "What does the chart show?"
  • Context-Aware: Prioritizes relevant content types
  • Similarity Scoring: Confidence scores for retrieved chunks

Production Ready

  • Error Handling: Comprehensive error management
  • Logging: Detailed processing logs
  • Cleanup: Automatic temporary file management
  • Persistence: Save/load vector indices
  • Concurrent Processing: Batch operations for efficiency

๐Ÿค API Endpoints

Core Endpoints

  • POST /ingest - Upload and process PDF files
  • POST /query - Ask questions about documents
  • GET /status - System status and configuration

Data Management

  • GET /chunks - View processed document chunks
  • GET /chunks/{id} - View specific chunk details
  • GET /chunks/statistics - Chunk processing statistics
  • GET /chunks/search - Search chunks by text content
  • GET /embeddings - View embedding data and metadata

System Control

  • DELETE /reset - Clear current index and reset system

๐Ÿ” Troubleshooting

Common Issues

  1. API Key Missing:

    # Create .env file with your OpenAI API key
    echo "OPENAI_API_KEY=sk-your-key-here" > .env
  2. PDF Processing Errors:

    # Install system dependencies
    brew install poppler tesseract  # macOS
    sudo apt install poppler-utils tesseract-ocr  # Ubuntu
  3. Memory Issues:

    • Reduce chunk size in config
    • Process documents in batches
    • Use --no-vision flag for faster processing
  4. Vision Analysis Failing:

    • Check OpenAI API quotas
    • Verify image quality (300 DPI recommended)
    • Some complex diagrams may need manual review

๐Ÿ“ˆ Performance Considerations

  • Embedding Generation: ~100 chunks/second
  • Vision Processing: ~1-2 seconds per page
  • Query Response: <2 seconds for typical queries
  • Memory Usage: ~8MB per 1000 chunks stored
  • Storage: FAISS index + metadata ~50% of original PDF size

๐Ÿ”ฎ Future Enhancements

  • Multi-PDF indexing and cross-document search
  • Custom embedding models (local inference)
  • Advanced query routing (keyword vs semantic)
  • Document summarization and key phrase extraction
  • Batch processing for large document collections
  • Integration with document management systems

๐Ÿ“ License

This project is built for educational and research purposes. The architecture demonstrates core RAG principles and can be adapted for various document processing applications.


Built with โค๏ธ using Python, FastAPI, Streamlit, OpenAI API, and FAISS

About

RAG implemented from scratch without using LangChain and LangGraph - designed specifically for processing and querying PDF documents with advanced support for visual content like tables, charts, and mathematical formulas.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors