RAG Document Q&A System

A specialized large document (800k+ words) Q&A AI agent using Retrieval-Augmented Generation (RAG). This system can efficiently process, index, and query massive documents to provide accurate, contextual answers.

🚀 Features

Large Document Support: Handle documents up to 800k+ words efficiently
Multiple Format Support: PDF, DOCX, TXT, and Markdown files
Advanced RAG Pipeline: Combines retrieval and generation for accurate answers
Vector Database Options: FAISS, ChromaDB, and Pinecone support
Conversational Mode: Maintains context across multiple queries
Web Interface: Beautiful Streamlit UI for easy interaction
REST API: FastAPI-based API for integration
CLI Tool: Command-line interface for batch processing
Scalable Architecture: Modular design for easy extension

📋 Requirements

Python 3.8+
OpenAI API key
8GB+ RAM recommended for large documents
2GB+ disk space for vector indexes

🛠️ Installation

Clone the repository:

git clone <repository-url>
cd doc-reader

Create virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Set up environment variables:

cp .env.example .env
# Edit .env with your API keys and configurations

Create necessary directories:

mkdir -p documents indexes logs

⚙️ Configuration

Edit the .env file with your settings:

# Required
OPENAI_API_KEY=your_openai_api_key_here

# Optional - Vector Database
VECTOR_DB_TYPE=faiss  # Options: faiss, chroma, pinecone
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
TOP_K_RESULTS=5

# Model Settings
EMBEDDING_MODEL=text-embedding-ada-002
CHAT_MODEL=gpt-4-turbo-preview
TEMPERATURE=0.1
MAX_TOKENS=4000

🎯 Quick Start

1. Web Interface (Recommended)

Start the API server:

python -m src.api

In another terminal, start the Streamlit app:

streamlit run src/streamlit_app.py

Open your browser to http://localhost:8501

2. Command Line Interface

Add documents to the index:

python -m src.cli add path/to/document1.pdf path/to/document2.docx

Query the documents:

python -m src.cli query "What are the main findings in the research?"

Interactive mode:

python -m src.cli interactive --conversational

3. Python API

from src.rag_engine import RAGEngine

# Initialize the RAG engine
rag = RAGEngine(index_name="my_documents")

# Add documents
rag.add_documents([
    "path/to/large_document.pdf",
    "path/to/research_paper.docx"
])

# Query the documents
result = rag.query("What are the key conclusions?")
print(result.answer)

# Access source documents
for doc in result.source_documents:
    print(f"Source: {doc.metadata['filename']}")
    print(f"Content: {doc.page_content[:200]}...")

📚 Usage Examples

Processing Large Academic Papers

from src.rag_engine import RAGEngine

# Create specialized index for academic papers
rag = RAGEngine(index_name="academic_papers")

# Add multiple research papers
papers = [
    "papers/machine_learning_survey_2024.pdf",
    "papers/deep_learning_advances.pdf", 
    "papers/nlp_transformers_review.pdf"
]

rag.add_documents(papers)

# Ask research questions
questions = [
    "What are the latest advances in transformer models?",
    "How do different ML approaches compare in performance?",
    "What are the main challenges in current NLP research?"
]

for question in questions:
    result = rag.query(question)
    print(f"Q: {question}")
    print(f"A: {result.answer}\n")

Legal Document Analysis

from src.rag_engine import ConversationalRAG

# Use conversational mode for complex legal queries
legal_rag = ConversationalRAG(index_name="legal_docs")

# Add legal documents
legal_rag.add_documents([
    "contracts/service_agreement_800k_words.pdf",
    "regulations/compliance_manual.docx"
])

# Interactive legal consultation
result1 = legal_rag.conversational_query(
    "What are the termination clauses in the service agreement?"
)

result2 = legal_rag.conversational_query(
    "How do these clauses relate to the compliance requirements?"
)

Technical Documentation Q&A

# CLI example for technical docs
python -m src.cli add \
    manuals/software_manual_v2.pdf \
    docs/api_documentation.md \
    guides/troubleshooting_guide.docx

# Query with high precision
python -m src.cli query \
    "How do I configure the authentication module?" \
    --top-k 3 \
    --include-sources \
    --include-scores

🏗️ Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Document      │    │   Vector Store   │    │   LLM Engine    │
│   Processor     │───▶│   (FAISS/Chroma) │───▶│   (GPT-4)       │
└─────────────────┘    └──────────────────┘    └─────────────────┘
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Text Chunks   │    │   Embeddings     │    │   Contextual    │
│   + Metadata    │    │   + Similarity   │    │   Answers       │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Key Components

Document Processor: Extracts and chunks text from various formats
Vector Store Manager: Handles embedding storage and similarity search
RAG Engine: Orchestrates retrieval and generation
API Layer: FastAPI for REST endpoints
UI Layer: Streamlit for web interface
CLI: Command-line tools for batch operations

🔧 Advanced Configuration

Custom Chunking Strategy

from src.document_processor import DocumentProcessor
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Create custom text splitter for code documents
code_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=300,
    separators=["\nclass ", "\ndef ", "\n\n", "\n", " "],
    length_function=len
)

processor = DocumentProcessor()
processor.text_splitter = code_splitter

Multiple Vector Stores

# Use different vector stores for different document types
academic_rag = RAGEngine(index_name="academic")  # Uses FAISS
legal_rag = RAGEngine(index_name="legal")        # Uses ChromaDB

# Configure in .env:
# VECTOR_DB_TYPE=faiss  # or chroma, pinecone

Custom Prompts

from langchain.prompts import PromptTemplate

custom_prompt = PromptTemplate(
    template="""You are a legal expert assistant. Use the provided context to answer legal questions accurately.

Context: {context}

Question: {question}

Please provide a detailed legal analysis with relevant citations from the context.

Answer:""",
    input_variables=["context", "question"]
)

rag.prompt_template = custom_prompt

📊 Performance Optimization

For Large Documents (800k+ words)

Increase chunk overlap for better context:

CHUNK_SIZE=1500
CHUNK_OVERLAP=300

Use hierarchical chunking:

# Process in sections first, then chunks
processor.hierarchical_chunking = True

Optimize vector search:

TOP_K_RESULTS=10  # Retrieve more candidates

Use efficient vector store:

VECTOR_DB_TYPE=faiss  # Fastest for large datasets

Memory Management

# Process documents in batches
from src.utils import chunk_list

large_doc_list = ["doc1.pdf", "doc2.pdf", ...]
for batch in chunk_list(large_doc_list, batch_size=5):
    rag.add_documents(batch)

🧪 Testing

Run the test suite:

# Run all tests
pytest tests/

# Run specific test categories
pytest tests/ -k "test_document_processor"
pytest tests/ -k "test_rag_engine"

# Run with coverage
pytest tests/ --cov=src --cov-report=html

📖 API Documentation

Start the API server and visit:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Key Endpoints

POST /upload-documents: Upload and process documents
POST /query: Query the document index
POST /conversational-query: Query with conversation context
GET /index-stats: Get index statistics
DELETE /index: Clear the index

🚀 Deployment

Docker Deployment

FROM python:3.9-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .

EXPOSE 8000
CMD ["python", "-m", "src.api"]

Production Considerations

Use production vector database (Pinecone, Weaviate)
Implement rate limiting and authentication
Scale with load balancer for high traffic
Monitor API performance and costs
Backup vector indexes regularly

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests for new functionality
Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🆘 Troubleshooting

Common Issues

Q: "Import errors when running the application"

# Install missing dependencies
pip install -r requirements.txt

# Check Python version
python --version  # Should be 3.8+

Q: "Out of memory when processing large documents"

# Reduce chunk size and batch processing
CHUNK_SIZE=500
# Process documents one at a time

Q: "API server not starting"

# Check if port is available
lsof -i :8000

# Use different port
python -m src.api --port 8001

Q: "Poor answer quality"

# Increase context retrieval
TOP_K_RESULTS=10

# Adjust chunk overlap
CHUNK_OVERLAP=400

# Try different models
CHAT_MODEL=gpt-4-turbo-preview

📞 Support

Create an issue on GitHub
Check the documentation
Review the test cases for usage examples

Built with ❤️ for efficient large document processing

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
example_usage.py		example_usage.py
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py
start.sh		start.sh

djleamen/doc-reader

Folders and files

Latest commit

History

Repository files navigation