local llama version of the hybrid search RAG system

Shubhamsaboo · Dec 5, 2024 · 5edd3dd · 5edd3dd
1 parent 367e2cf
commit 5edd3dd
Show file tree

Hide file tree

Showing 4 changed files with 361 additions and 0 deletions.
diff --git a/rag_tutorials/local_hybrid_search_rag/.gitignore b/rag_tutorials/local_hybrid_search_rag/.gitignore
@@ -0,0 +1 @@
+.flashrank_cache
diff --git a/rag_tutorials/local_hybrid_search_rag/README.md b/rag_tutorials/local_hybrid_search_rag/README.md
@@ -0,0 +1,128 @@
+# Local LLM Hybrid Search-RAG Assistant 🤖
+
+A powerful document Q&A application that leverages Hybrid Search (RAG) and local LLMs for comprehensive answers. Built with RAGLite for robust document processing and retrieval, and Streamlit for an intuitive chat interface, this system combines document-specific knowledge with local LLM capabilities to deliver accurate and contextual responses.
+
+## Demo:
+
+## Quick Start
+
+For immediate testing, use these tested model configurations:
+```bash
+# LLM Model
+bartowski/Llama-3.2-3B-Instruct-GGUF/Llama-3.2-3B-Instruct-Q4_K_M.gguf@4096
+
+# Embedder Model
+lm-kit/bge-m3-gguf/bge-m3-Q4_K_M.gguf@1024
+```
+These models offer a good balance of performance and resource usage, and have been verified to work well together even on a MacBook Air M2 with 8GB RAM.
+
+## Features
+
+- **Local LLM Integration**:
+  - Uses llama-cpp-python models for local inference
+  - Supports various quantization formats (Q4_K_M recommended)
+  - Configurable context window sizes
+
+- **Document Processing**:
+  - PDF document upload and processing
+  - Automatic text chunking and embedding
+  - Hybrid search combining semantic and keyword matching
+  - Reranking for better context selection
+
+- **Multi-Model Integration**:
+  - Local LLM for text generation (e.g., Llama-3.2-3B-Instruct)
+  - Local embeddings using BGE models
+  - FlashRank for local reranking
+
+## Prerequisites
+
+1. **Install spaCy Model**:
+   ```bash
+   pip install https://github.com/explosion/spacy-models/releases/download/xx_sent_ud_sm-3.7.0/xx_sent_ud_sm-3.7.0-py3-none-any.whl
+   ```
+
+2. **Install Accelerated llama-cpp-python** (Optional but recommended):
+   ```bash
+   # Configure installation variables
+   LLAMA_CPP_PYTHON_VERSION=0.3.2
+   PYTHON_VERSION=310 # 3.10, 3.11, 3.12
+   ACCELERATOR=metal  # For Mac
+   # ACCELERATOR=cu121  # For NVIDIA GPU
+   PLATFORM=macosx_11_0_arm64  # For Mac
+   # PLATFORM=linux_x86_64  # For Linux
+   # PLATFORM=win_amd64  # For Windows
+
+   # Install accelerated version
+   pip install "https://github.com/abetlen/llama-cpp-python/releases/download/v$LLAMA_CPP_PYTHON_VERSION-$ACCELERATOR/llama_cpp_python-$LLAMA_CPP_PYTHON_VERSION-cp$PYTHON_VERSION-cp$PYTHON_VERSION-$PLATFORM.whl"
+   ```
+
+3. **Install Dependencies**:
+   ```bash
+   pip install -r requirements.txt
+   ```
+
+## Model Setup
+
+RAGLite extends LiteLLM with support for llama.cpp models using llama-cpp-python. To select a llama.cpp model (e.g., from bartowski's collection), use a model identifier of the form "llama-cpp-python/<hugging_face_repo_id>/<filename>@<n_ctx>", where n_ctx is an optional parameter that specifies the context size of the model.
+
+1. **LLM Model Path Format**:
+   ```
+   llama-cpp-python/<repo>/<model>/<filename>@<context_length>
+   ```
+   Example:
+   ```
+   bartowski/Llama-3.2-3B-Instruct-GGUF/Llama-3.2-3B-Instruct-Q4_K_M.gguf@4096
+   ```
+
+2. **Embedder Model Path Format**:
+   ```
+   llama-cpp-python/<repo>/<model>/<filename>@<dimension>
+   ```
+   Example:
+   ```
+   lm-kit/bge-m3-gguf/bge-m3-Q4_K_M.gguf@1024
+   ```
+
+## Database Setup
+
+The application supports multiple database backends:
+
+- **PostgreSQL** (Recommended):
+  - Create a free serverless PostgreSQL database at [Neon](https://neon.tech) in a few clicks
+  - Get instant provisioning and scale-to-zero capability
+  - Connection string format: `postgresql://user:pass@ep-xyz.region.aws.neon.tech/dbname`
+
+
+## How to Run
+
+1. **Start the Application**:
+   ```bash
+   streamlit run local_main.py
+   ```
+
+2. **Configure the Application**:
+   - Enter LLM model path
+   - Enter embedder model path
+   - Set database URL
+   - Click "Save Configuration"
+
+3. **Upload Documents**:
+   - Upload PDF files through the interface
+   - Wait for processing completion
+
+4. **Start Chatting**:
+   - Ask questions about your documents
+   - Get responses using local LLM
+   - Fallback to general knowledge when needed
+
+## Notes
+
+- Context window size of 4096 is recommended for most use cases
+- Q4_K_M quantization offers good balance of speed and quality
+- BGE-M3 embedder with 1024 dimensions is optimal
+- Local models require sufficient RAM and CPU/GPU resources
+- Metal acceleration available for Mac, CUDA for NVIDIA GPUs
+
+## Contributing
+
+Contributions are welcome! Please feel free to submit a Pull Request.
diff --git a/rag_tutorials/local_hybrid_search_rag/local_main.py b/rag_tutorials/local_hybrid_search_rag/local_main.py
@@ -0,0 +1,217 @@
+import os
+import logging
+import streamlit as st
+from raglite import RAGLiteConfig, insert_document, hybrid_search, retrieve_chunks, rerank_chunks, rag
+from rerankers import Reranker
+from typing import List, Dict, Any
+from pathlib import Path
+import time
+import warnings
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+warnings.filterwarnings("ignore", message=".*torch.classes.*")
+
+RAG_SYSTEM_PROMPT = """
+You are a friendly and knowledgeable assistant that provides complete and insightful answers.
+Answer the user's question using only the context below.
+When responding, you MUST NOT reference the existence of the context, directly or indirectly.
+Instead, you MUST treat the context as if its contents are entirely part of your working memory.
+""".strip()
+
+def initialize_config(settings: Dict[str, Any]) -> RAGLiteConfig:
+    try:
+        return RAGLiteConfig(
+            db_url=settings["DBUrl"],
+            llm=f"llama-cpp-python/{settings['LLMPath']}",
+            embedder=f"llama-cpp-python/{settings['EmbedderPath']}",
+            embedder_normalize=True,
+            chunk_max_size=512,
+            reranker=Reranker("ms-marco-MiniLM-L-12-v2", model_type="flashrank")
+        )
+    except Exception as e:
+        raise ValueError(f"Configuration error: {e}")
+
+def process_document(file_path: str) -> bool:
+    try:
+        if not st.session_state.get('my_config'):
+            raise ValueError("Configuration not initialized")
+        insert_document(Path(file_path), config=st.session_state.my_config)
+        return True
+    except Exception as e:
+        logger.error(f"Error processing document: {str(e)}")
+        return False
+
+def perform_search(query: str) -> List[dict]:
+    try:
+        chunk_ids, scores = hybrid_search(query, num_results=10, config=st.session_state.my_config)
+        if not chunk_ids:
+            return []
+        chunks = retrieve_chunks(chunk_ids, config=st.session_state.my_config)
+        return rerank_chunks(query, chunks, config=st.session_state.my_config)
+    except Exception as e:
+        logger.error(f"Search error: {str(e)}")
+        return []
+
+def handle_fallback(query: str) -> str:
+    try:
+        system_prompt = """You are a helpful AI assistant. When you don't know something, 
+        be honest about it. Provide clear, concise, and accurate responses."""
+
+        response_stream = rag(
+            prompt=query,
+            system_prompt=system_prompt,
+            search=None,
+            messages=[],
+            max_tokens=1024,
+            temperature=0.7,
+            config=st.session_state.my_config
+        )
+
+        full_response = ""
+        for chunk in response_stream:
+            full_response += chunk
+
+        if not full_response.strip():
+            return "I apologize, but I couldn't generate a response. Please try rephrasing your question."
+
+        return full_response
+
+    except Exception as e:
+        logger.error(f"Fallback error: {str(e)}")
+        return "I apologize, but I encountered an error while processing your request. Please try again."
+
+def main():
+    st.set_page_config(page_title="Local LLM-Powered Hybrid Search-RAG Assistant", layout="wide")
+
+    for state_var in ['chat_history', 'documents_loaded', 'my_config']:
+        if state_var not in st.session_state:
+            st.session_state[state_var] = [] if state_var == 'chat_history' else False if state_var == 'documents_loaded' else None
+
+    with st.sidebar:
+        st.title("Configuration")
+
+        llm_path = st.text_input(
+            "LLM Model Path", 
+            value=st.session_state.get('llm_path', ''),
+            placeholder="TheBloke/Llama-2-7B-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf@4096",
+            help="Path to your local LLM model in GGUF format"
+        )
+
+        embedder_path = st.text_input(
+            "Embedder Model Path",
+            value=st.session_state.get('embedder_path', ''),
+            placeholder="lm-kit/bge-m3-gguf/bge-m3-Q4_K_M.gguf@1024",
+            help="Path to your local embedding model in GGUF format"
+        )
+
+        db_url = st.text_input(
+            "Database URL",
+            value=st.session_state.get('db_url', ''),
+            placeholder="postgresql://user:pass@host:port/db",
+            help="Database connection URL"
+        )
+
+        if st.button("Save Configuration"):
+            try:
+                if not all([llm_path, embedder_path, db_url]):
+                    st.error("All fields are required!")
+                    return
+
+                settings = {
+                    "LLMPath": llm_path,
+                    "EmbedderPath": embedder_path,
+                    "DBUrl": db_url
+                }
+
+                st.session_state.my_config = initialize_config(settings)
+                st.success("Configuration saved successfully!")
+
+            except Exception as e:
+                st.error(f"Configuration error: {str(e)}")
+
+    st.title("Local LLM-Powered Hybrid Search-RAG Assistant")
+
+    if st.session_state.my_config:
+        uploaded_files = st.file_uploader(
+            "Upload PDF documents",
+            type=["pdf"],
+            accept_multiple_files=True,
+            key="pdf_uploader"
+        )
+
+        if uploaded_files:
+            success = False
+            for uploaded_file in uploaded_files:
+                with st.spinner(f"Processing {uploaded_file.name}..."):
+                    temp_path = f"temp_{uploaded_file.name}"
+                    with open(temp_path, "wb") as f:
+                        f.write(uploaded_file.getvalue())
+
+                    if process_document(temp_path):
+                        st.success(f"Successfully processed: {uploaded_file.name}")
+                        success = True
+                    else:
+                        st.error(f"Failed to process: {uploaded_file.name}")
+                    os.remove(temp_path)
+
+            if success:
+                st.session_state.documents_loaded = True
+                st.success("Documents are ready! You can now ask questions about them.")
+
+    if st.session_state.documents_loaded:
+        for msg in st.session_state.chat_history:
+            with st.chat_message("user"): st.write(msg[0])
+            with st.chat_message("assistant"): st.write(msg[1])
+
+        user_input = st.chat_input("Ask a question about the documents...")
+        if user_input:
+            with st.chat_message("user"): st.write(user_input)
+            with st.chat_message("assistant"):
+                message_placeholder = st.empty()
+                try:
+                    reranked_chunks = perform_search(query=user_input)
+                    if not reranked_chunks or len(reranked_chunks) == 0:
+                        logger.info("No relevant documents found. Falling back to local LLM.")
+                        with st.spinner("Using general knowledge to answer..."):
+                            full_response = handle_fallback(user_input)
+                            if full_response.startswith("I apologize"):
+                                st.warning("No relevant documents found and fallback failed.")
+                            else:
+                                st.info("Answering from general knowledge.")
+                    else:
+                        formatted_messages = [
+                            {"role": "user" if i % 2 == 0 else "assistant", "content": msg}
+                            for i, msg in enumerate([m for pair in st.session_state.chat_history for m in pair])
+                            if msg
+                        ]
+
+                        response_stream = rag(
+                            prompt=user_input,
+                            system_prompt=RAG_SYSTEM_PROMPT,
+                            search=hybrid_search,
+                            messages=formatted_messages,
+                            max_contexts=5,
+                            config=st.session_state.my_config
+                        )
+
+                        full_response = ""
+                        for chunk in response_stream:
+                            full_response += chunk
+                            message_placeholder.markdown(full_response + "▌")
+
+                    message_placeholder.markdown(full_response)
+                    st.session_state.chat_history.append((user_input, full_response))
+
+                except Exception as e:
+                    logger.error(f"Error: {str(e)}")
+                    st.error(f"Error: {str(e)}")
+    else:
+        st.info(
+            "Please configure your model paths and upload documents to get started."
+            if not st.session_state.my_config
+            else "Please upload some documents to get started."
+        )
+
+if __name__ == "__main__":
+    main()
diff --git a/rag_tutorials/local_hybrid_search_rag/requirements.txt b/rag_tutorials/local_hybrid_search_rag/requirements.txt
@@ -0,0 +1,15 @@
+raglite==0.2.1
+llama-cpp-python>=0.2.56
+sentence-transformers>=2.5.1
+pydantic==2.10.1
+sqlalchemy>=2.0.0
+psycopg2-binary>=2.9.9
+pypdf>=3.0.0
+python-dotenv>=1.0.0
+rerankers==0.6.0
+spacy>=3.7.0
+streamlit>=1.31.0
+flashrank==0.2.9
+numpy>=1.24.0
+pandas>=2.0.0
+tqdm>=4.66.0