Skip to content

A comprehensive Multimodal Retrieval-Augmented Generation (RAG) application that combines FastAPI backend with Streamlit frontend, supporting multiple AI models, advanced OCR capabilities, and intelligent document processing.

License

Notifications You must be signed in to change notification settings

selvatharrun/Multimodal-RAG-Application

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Multimodal RAG Application with FastAPI & Streamlit

Python Version FastAPI Streamlit License

A comprehensive Multimodal Retrieval-Augmented Generation (RAG) application that combines FastAPI backend with Streamlit frontend, supporting multiple AI models, advanced OCR capabilities, and intelligent document processing.

✨ Features

πŸ€– Multi-Model AI Support

  • Azure OpenAI (GPT-4, GPT-3.5-turbo)
  • Google Gemini (gemini-1.5-flash, gemini-pro)
  • Claude (claude-3-sonnet via AWS Bedrock)
  • Qwen & Nvidia (Local models via Ollama)

πŸ–ΌοΈ Advanced OCR & Vision

  • Tesseract OCR (Free, local processing)
  • Florence-2 (Microsoft's vision model)
  • Google Vision API (Cloud-based accuracy)
  • OpenAI GPT-4 Vision (Intelligent understanding)
  • Claude Vision (Advanced document analysis)

πŸ” Intelligent Search

  • BM25 (Traditional keyword search)
  • Qdrant Embeddings (Semantic vector search)
  • Reciprocal Rank Fusion (Hybrid approach)

πŸ“± User Interface

  • Streamlit Web App with multiple pages
  • Document Chat interface
  • Image Analysis powered by Florence-2
  • Knowledge Object generation
  • Real-time OCR processing

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Streamlit UI      β”‚ ── β”‚   FastAPI Backend    β”‚ ── β”‚   AI Models & OCR   β”‚
β”‚  (Port 8501)        β”‚    β”‚  (Ports 8000/8001)  β”‚    β”‚  (External APIs)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  • Port 8000: Main API for Knowledge Object generation
  • Port 8001: Chat API for document interaction and OCR
  • Port 8501: Streamlit web interface

πŸš€ Quick Start

Prerequisites

  • Python 3.11+
  • Windows OS (current configuration)
  • Git

1. Clone and Setup

git clone https://github.com/selvatharrun/Multimodal-RAG-Application.git
cd Multimodal-RAG-Application

# Create virtual environment
python -m venv venv
.\venv\Scripts\Activate.ps1

# Install dependencies
pip install -r requirements.txt

2. Install Tesseract OCR

# Using Windows Package Manager (Recommended)
winget install --id UB-Mannheim.TesseractOCR

# Or download from: https://github.com/UB-Mannheim/tesseract/releases

3. Configure API Keys

# Copy template and add your API keys
cp florence2/config.properties.template florence2/config.properties
# Edit config.properties with your actual API keys

4. Run the Application

# Quick start (all servers)
.\run_streamlit.bat

# Or manually:
# Terminal 1: python florence2/main.py
# Terminal 2: python florence2/chatapi.py  
# Terminal 3: streamlit run florence2/mainpage.py

5. Access the Application

πŸ“– Documentation

Document Description
COMPLETE_DOCUMENTATION.md πŸ“š Complete setup, API docs, troubleshooting
TESSERACT_SETUP.md πŸ”§ Tesseract OCR installation guide
config.properties.template βš™οΈ Configuration template for API keys

πŸ”‘ API Key Setup

Azure OpenAI

  1. Visit Azure Portal
  2. Create/access Azure OpenAI resource
  3. Copy Key, Endpoint, and Deployment Name

Google Gemini

  1. Go to Google AI Studio
  2. Create new API key

AWS Claude

  1. Access AWS Console
  2. Create IAM access keys
  3. Enable Bedrock service

πŸ“‘ API Endpoints

Main API (Port 8000)

  • POST /upload-file/ - Generate Knowledge Objects from documents

Chat API (Port 8001)

  • POST /extract_text/ - Extract text using various OCR methods
  • POST /search_and_respond/ - Chat with documents using RAG

🎯 Use Cases

  • Document Analysis: Extract insights from PDFs, DOCX, PPTX
  • Knowledge Management: Generate structured knowledge articles
  • Visual Understanding: Analyze images and charts with AI
  • Interactive Chat: Q&A with document content
  • Multi-format Processing: Handle text, images, and mixed content

πŸ› οΈ Technology Stack

  • Backend: FastAPI, Python 3.11+
  • Frontend: Streamlit
  • AI/ML: LangChain, Transformers, PyTorch
  • OCR: Tesseract, Florence-2, Cloud APIs
  • Search: Qdrant, BM25S, Embeddings
  • Models: Azure OpenAI, Google Gemini, Claude, Qwen

πŸ› Troubleshooting

Common Issues

  1. Connection Errors: Ensure all servers are running on correct ports
  2. API Key Errors: Verify keys in config.properties
  3. Import Errors: Check virtual environment and dependencies
  4. Tesseract Errors: Verify installation and path configuration

See COMPLETE_DOCUMENTATION.md for detailed troubleshooting.

πŸ“ Project Structure

project-root/
β”œβ”€β”€ florence2/                    # Main application
β”‚   β”œβ”€β”€ main.py                   # FastAPI server (8000)
β”‚   β”œβ”€β”€ chatapi.py                # FastAPI server (8001)  
β”‚   β”œβ”€β”€ mainpage.py               # Streamlit main page
β”‚   β”œβ”€β”€ config.properties         # API configuration
β”‚   β”œβ”€β”€ pages/                    # Streamlit pages
β”‚   └── API/                      # Backend modules
β”œβ”€β”€ venv/                         # Virtual environment
β”œβ”€β”€ requirements.txt              # Dependencies
β”œβ”€β”€ COMPLETE_DOCUMENTATION.md     # Full documentation
└── README.md                     # This file

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Microsoft Florence-2 for vision capabilities
  • OpenAI for language models
  • Google for Gemini models
  • Anthropic for Claude
  • Tesseract for OCR functionality
  • LangChain for RAG framework

πŸ“ž Support

  • πŸ“– Check COMPLETE_DOCUMENTATION.md for detailed guides
  • πŸ› Report issues on GitHub
  • πŸ’¬ Join discussions in the repository

🌟 Star this repository if you find it useful! 🌟

About

A comprehensive Multimodal Retrieval-Augmented Generation (RAG) application that combines FastAPI backend with Streamlit frontend, supporting multiple AI models, advanced OCR capabilities, and intelligent document processing.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published