PDF-based RAG System Demo

A proof-of-concept implementation of Retrieval-Augmented Generation (RAG) focused on processing PDF documents with both text and image extraction capabilities.

Overview

This project demonstrates a complete RAG pipeline that:

Ingests PDF documents extracting both text and images
Chunks text content for optimal retrieval
Generates vector embeddings for semantic search
Extracts and processes images from PDFs
Captions images using Vision Language Models (VLM)
Stores all data in PostgreSQL for efficient retrieval

Features

PDF Processing: Extract text and images from PDF documents using pdfplumber and PyMuPDF
Semantic Chunking: Break text into meaningful chunks for better retrieval
Vector Embeddings: Generate embeddings for text and image captions
Image Understanding: Caption images using OpenAI's vision models
OCR Integration: Extract text from images when available
Web Interface: Browse and search through ingested documents

Project Structure

.
├── apps/
│   ├── rag/                     # Python backend for RAG processing
│   │   ├── src/rag/             # Core RAG implementation
│   │   │   ├── ingest_pdf.py    # PDF processing pipeline
│   │   │   ├── db.py            # Database connection handling
│   │   │   ├── embeddings.py    # Vector embedding generation
│   │   │   ├── chunking.py      # Text chunking algorithms
│   │   │   └── caption_vlm.py   # Vision model integration
│   │   ├── samples/             # Sample documents and generated thumbnails
│   │   │   └── images/          # Extracted image thumbnails
│   │   ├── tests/               # Test suite
│   │   └── .env.example         # Environment configuration template
│   └── web/                     # Frontend application
├── infra/                       # Infrastructure configuration
├── docs/                        # Documentation
└── Makefile                     # Build and deployment scripts

Setup

Prerequisites

Python 3.12+
PostgreSQL database
OpenAI API key

Configuration

Copy the example environment file and configure your settings:

cp apps/rag/.env.example apps/rag/.env

Required environment variables:

DATABASE_URL=postgresql://rag:ragpw@localhost:5433/ragdb
OPENAI_API_KEY=YOUR_OPENAI_API_KEY
OPENAI_EMBED_MODEL=text-embedding-3-small
VLM_MODEL=gpt-4o

Database Setup

The project includes a Docker Compose setup for PostgreSQL with pgvector extension:

# Start the PostgreSQL database with vector extension support
cd infra
docker-compose up -d

This will:

Start a PostgreSQL 16 instance with pgvector extension at port 5433
Create the required database schema via initialization scripts
Launch Adminer database UI available at http://localhost:8080

The database schema includes tables for:

documents - Document metadata
chunks - Text chunks with embeddings
figures - Extracted images with captions and embeddings

Usage

Ingesting a PDF

cd apps/rag
python -m rag.ingest_pdf path/to/your/document.pdf

This will:

Extract text from the PDF and split into chunks
Generate vector embeddings for each chunk
Extract images from the PDF
Generate captions and OCR for images using VLM
Store all information in the database

Web Interface

The web interface allows browsing and searching through ingested documents.

cd apps/web
npm run dev

How It Works

PDF Ingestion Process

Document Registration: Each PDF is registered with a unique ID
Text Extraction: Text is extracted and chunked for optimal retrieval
Image Extraction: Images are extracted directly from the PDF
Image Understanding: Images are processed by a Vision Language Model to:
- Generate short captions
- Generate detailed descriptions
- Extract text visible in the image (OCR)
- Identify entities and generate tags
Vector Storage: All text and image captions are converted to vector embeddings

Retrieval Process

Queries are processed through a similar pipeline:

Convert query to vector embedding
Find similar content (text chunks or images) using vector similarity
Return relevant information with proper context

Technical Details

Embedding Model: OpenAI's text-embedding-3-small
Vision Model: OpenAI's GPT-4o for image understanding
PDF Processing: Combination of pdfplumber (text) and PyMuPDF (images)
Database: PostgreSQL with vector extensions

Contributing

This is a proof-of-concept demonstration. Contributions are welcome to improve:

Performance optimizations
Support for additional document types
Enhanced chunking strategies
Improved image understanding

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
apps		apps
docs		docs
infra		infra
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

PDF-based RAG System Demo

Overview

Features

Project Structure

Setup

Prerequisites

Configuration

Database Setup

Usage

Ingesting a PDF

Web Interface

How It Works

PDF Ingestion Process

Retrieval Process

Technical Details

Contributing

License

About

Uh oh!

Releases

Packages

Languages

Uh oh!

Uh oh!

go-fireball/rag-multimodal-demo

Folders and files

Latest commit

History

Repository files navigation

PDF-based RAG System Demo

Overview

Features

Project Structure

Setup

Prerequisites

Configuration

Database Setup

Usage

Ingesting a PDF

Web Interface

How It Works

PDF Ingestion Process

Retrieval Process

Technical Details

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages