A high-performance Multimodal Retrieval-Augmented Generation (RAG) system that combines visual document understanding with conversational AI. Built with HPC-ColPali for efficient visual embeddings and Janus-Pro for multimodal reasoning.
- Visual Document Understanding: Process PDFs with advanced visual comprehension
- HPC-ColPali Optimization: Hierarchical Patch Clustering for 3x faster retrieval
- Multimodal Reasoning: Janus-Pro model for contextual Q&A
- Efficient Storage: Qdrant vector database with optimized indexing
- High Accuracy: Multi-image context synthesis for comprehensive answers
flowchart TD
A[PDF Upload] --> B[Pages to Images]
B --> C[ColPali Patch Encoder]
C --> D{HPC Compression?}
D -->|Yes| E[K-Means Quantization\nK=256 centroids]
D -->|No| F[Mean Pooling]
E --> G[Attention-Guided Pruning\nkeep p=60 %]
F --> H[Qdrant Vector Store]
G --> H
I[User Query] --> J[Query Patch Embedding]
J --> K[HNSW Search]
K --> L[Top-k Multi-Image Context]
L --> M[Janus-Pro LLM]
M --> N[Streaming Response]
This implementation is based on the HPC-ColPali research paper:
"Hierarchical Patch Clustering for Efficient Visual Retrieval"
Read Paper
| Method | Latency (ms) | Memory (GB) | nDCG@5 |
|---|---|---|---|
| ColPali (float32) | 120 | 2.56 | 0.845 |
| HPC-ColPali | 40 | 0.80 | 0.847 |
- Python 3.8+
- CUDA-compatible GPU (recommended)
- Qdrant vector database
# Clone the repository
git clone https://github.com/yourusername/multimodal-rag-hpc-colpali.git
cd multimodal-rag-hpc-colpali
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Start Qdrant (using Docker)
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrantstreamlit run streamlit_app.py- Upload your PDF documents through the sidebar
- Wait for automatic processing and embedding generation
- HPC codebook is built automatically for optimal performance
- Ask questions about your uploaded documents
- Get contextual answers with multi-image reasoning
- Enjoy real-time streaming responses
Customize your setup via config.yaml:
| Component | Description | File |
|---|---|---|
| Embedder | HPC-ColPali visual embeddings | src/embedder.py |
| Vector Store | Qdrant integration with optimization | src/vector_store.py |
| RAG Pipeline | End-to-end retrieval and generation | src/rag_pipeline.py |
| HPC Module | Hierarchical patch clustering | src/hpc_colpali.py |
| Utilities | Helper functions and optimizations | src/utils.py |
- Streamlit App:
streamlit_app.py- Interactive web interface with document upload and chat
- Image Preprocessing: PDF pages → PIL Images
- Patch Extraction: Vision transformer patch encoding
- Clustering: K-means with optimized centroids
- Pruning: Attention-based patch selection
- Storage: Compressed vectors in Qdrant
- Retrieval: Fast similarity search with decompression
- Batch Processing: Optimized batch sizes for GPU memory
- Memory Management: Automatic garbage collection and cache clearing
- Mixed Precision: BFloat16 for faster inference
- Streaming: Real-time response generation
- HPC-ColPali Research: Hierarchical Patch Clustering Paper
- ColPali Framework: Visual document understanding foundation
- Janus-Pro Model: Multimodal reasoning capabilities
- Qdrant: High-performance vector database