A powerful, adaptive Retrieval-Augmented Generation (RAG) system built with a decoupled microservices architecture. This project separates core AI/ML inference tasks from business logic, creating a scalable, maintainable, and developer-friendly solution.
- Project Philosophy
- Key Features
- How It Works: The RAG Pipeline
- Architecture Overview
- Technology Stack
- Getting Started on Windows
- Project Structure
- Contributing
- License
- Contact
The core idea is to decouple AI from logic. In many projects, ML inference code is tangled with application logic (data processing, routing, state management). This project avoids that by splitting the system into two main services:
- RAG API (The Orchestrator): A lightweight service that manages the RAG pipeline, handles business logic, and communicates with the UI. It knows what to do.
- Inference API (The Muscle): A dedicated, heavy-duty service that runs all the demanding ML models (LLMs, Embedders, Rerankers). It knows how to do it.
This separation provides immense benefits in scalability, independent development, and easier maintenance.
- Fully Decoupled Services: Scale, develop, and deploy the UI, logic, and ML services independently.
- Real-Time Streaming: Delivers responses token-by-token for a dynamic and interactive user experience.
- Adaptive Context Strategy: Intelligently decides how to use retrieved information based on confidence scores to prevent hallucinations.
- Advanced RAG Pipeline: Incorporates HyDE, Hybrid Search, and Cross-Encoder Reranking for highly accurate and relevant results.
- Clean API Design: Uses Pydantic for data validation and a shared data models package for type-safe communication between services.
This engine employs a multi-stage process to generate the most accurate answers possible.
- Query Expansion with HyDE: The user's query is first sent to a small LLM to generate a hypothetical document. This enriches the query's semantic meaning, leading to better search results.
- Hybrid Search: The system performs two types of search in parallel:
- Dense Search (FAISS): Finds documents that are semantically similar in meaning.
- Sparse Search (BM25): Finds documents that match specific keywords.
- Cross-Encoder Reranking: The results from both searches are passed to a powerful cross-encoder model. It directly compares the original query against each retrieved document to calculate a precise relevance score, filtering out noise and promoting the best context.
- Adaptive Context Strategy: Based on the top reranker score, the engine makes a smart decision:
- High Confidence: Provide a rich, detailed context to the LLM.
- Medium Confidence: Summarize the retrieved documents to distill key facts.
- Low Confidence: Reject the context entirely to avoid making things up (hallucinating).
- Final Generation: The curated context and original query are sent to the primary LLM to generate the final, streamed answer.
The microservices pattern ensures a clear separation of concerns. The user interacts with a web client, which communicates with the RAG API (orchestrator). This API, in turn, offloads all heavy ML computations to the dedicated Inference API.
flowchart TD
subgraph UI["User Interface"]
U(("User"))
GD(["Gradio Web UI"])
end
subgraph RAG["π§ RAG API @8000"]
RagApp(["FastAPI Orchestrator"])
end
subgraph INF["πͺ Inference API @8001"]
IA(["FastAPI ML Endpoints"])
Models(["LLM, Reranker, Embedder"])
DB[("Vector/Keyword DB\nfaiss.index, bm25.index")]
end
U -- Sends Question --> GD
GD -- HTTP POST /ask --> RagApp
RagApp -- "<b>1. Generate HyDE</b>" --> IA
RagApp -- "<b>2. Hybrid Search</b>" --> IA
RagApp -- "<b>3. Rerank Chunks</b>" --> IA
RagApp -- "<b>4. Decide Strategy</b>" --> RagApp
RagApp -- "<b>5. Generate Final Answer</b>" --> IA
IA -- Manages --> Models
IA -- Manages --> DB
IA -. Streaming Tokens .-> RagApp
RagApp -. Streaming JSON .-> GD
GD -. Renders Text .-> U
style RagApp fill:#bbdefb,stroke:#1976d2,stroke-width:2px
style IA fill:#c8e6c9,stroke:#388e3c,stroke-width:2px
| Category | Technology / Library | Purpose |
|---|---|---|
| Backend & API | FastAPI, Uvicorn |
Building high-performance, asynchronous APIs for both services. |
| AI / Machine Learning | llama-cpp-python, Sentence Transformers, mxbai-rerank, Transformers |
Running the LLM, generating embeddings, and reranking documents. |
| Vector & Keyword Search | Faiss, rank_bm25 |
Performing efficient similarity and keyword-based retrieval. |
| Frontend / UI | Gradio |
Creating a rapid, interactive web interface. |
| Data & Configuration | Pydantic, pydantic-settings |
Data validation, type safety, and environment configuration. |
| Communication | HTTPX |
Asynchronous HTTP client for inter-service communication. |
This guide provides a detailed process to ensure a smooth setup on a Windows machine with an NVIDIA GPU.
- An NVIDIA GPU with CUDA support.
- Git: Download here.
- Miniconda: Download here.
- Python 3.11 (will be installed via Conda).
Correctly installing the C++ compiler and CUDA Toolkit is the most critical step to avoid compilation errors.
llama-cpp-python requires a C++ compiler to be built from source if a pre-built wheel isn't used.
- Download the Visual Studio 2022 Community installer from the official website.
- Run the installer. On the Workloads tab, select "Desktop development with C++".
- CRITICAL STEP: Go to the Individual components tab. Search for and select an older C++ toolset for maximum compatibility. A good choice is "MSVC v143 - VS 2022 C++ x64/x86 build tools (v14.38...)". This ensures that even if CUDA doesn't support the absolute latest compiler, a compatible one is available, preventing common build failures.
- Click Install and wait for the process to complete.
CRITICAL NOTE: Only after visual studio is fully installed you can proceed for the next steps
- Check your NVIDIA driver's supported CUDA version by opening PowerShell and running
nvidia-smi. - Download the matching CUDA Toolkit from the NVIDIA Developer website.
- Run the installer. When prompted, choose the Custom (Advanced) installation.
- Ensure that "Visual Studio Integration" is checked. This is crucial for the toolkit to find your C++ compiler.
- Rebooting the system is required after the installation.
-
Clone the Repository:
git clone https://github.com/KareemSayed1232/Decoupled-Adaptive-Rag-Engine.git cd Decoupled-Adaptive-Rag-Engine -
Create and Activate Conda Environment:
conda create -n rag_env python=3.11 -y conda activate rag_env
These packages need to be installed carefully to enable GPU acceleration.
- Go to the PyTorch website and find the installation command for your specific system (select Pip, your OS, and your CUDA version).
- Run the command with the cuda version:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/[YOUR CUDA VERSION] # (for example if you have cuda 12.4) # pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
- Verify:
python -c "import torch; print(f'PyTorch CUDA available: {torch.cuda.is_available()}')"You should see:
PyTorch CUDA available: True
Always try Option 1 first. It's the fastest and easiest method.
Option 1: Install Pre-built Wheel
The library maintainers at pypi provide Pre-built Wheel for multiple CUDA versions.
βIt is recommended to follow the official instructions and use the recommended CUDA versions to ensure compatibility and avoid errors.β
-
Currently to Install the package, telling pip to look for the correct pre-built wheel. Replace
cu124below with your CUDA version (e.g.,cu124,cu125).# Example for CUDA 12.4 pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124
-
If this command succeeds without errors, you are done with this step.
Option 2: Compile from Source (Fallback Method)
Use this method only if Option 1 fails or if a pre-built wheel is not available for your CUDA version. This uses the Visual Studio compiler we set up in Step 1.
- Set the necessary environment variables in a PowerShell terminal.
# These flags tell the installer to build with CUDA support $env:CMAKE_ARGS="-DGGML_CUDA=on" $env:FORCE_CMAKE="1"
- Run the installation. This will take several minutes to compile.
pip install --force-reinstall --no-cache-dir llama-cpp-python
After completing either option, verify the installation:
python -c "from llama_cpp import llama_supports_gpu_offload; print(f'llama.cpp GPU offload enabled: {llama_supports_gpu_offload()}')"You should see:
llama.cpp GPU offload enabled: True
Use Option 2 if llama.cpp GPU offload enabled: False
-
Create
.envfile: Create a copy of the example template.# In PowerShell Copy-Item .env.example .env
-
Edit
.envfile: Open the new.envfile and update the paths to your local models and data.Required Model Paths:
These paths must be updated to point to the location of your downloaded GGUF model files on your local machine.
Variable Description Example Value LLM_MODEL_PATHPath to the main Large Language Model file. data/models/guff/Qwen3-8B-Q5_K_M.ggufHYDE_MODEL_PATHPath to the smaller LLM used for HyDE. data/models/guff/Phi-3-mini-4k-instruct-Q4_K_M.ggufRequired Data Paths:
Variable Description Example Value BASE_CONTEXT_FILEPath to your introductory context file. data/base_context.txtCOMPLETE_CONTEXT_FILEPath to your full knowledge base file. data/complete_context.txt
-
Install Project Python Packages:
# Install the shared data models package first pip install ./packages/shared-models # Install dependencies for each service pip install -r services/inference_api/requirements.txt pip install -r services/rag_api/requirements.txt pip install -r clients/gradio-demo/requirements.txt
-
Build Search Indexes: This script processes your documents and creates the search artifacts.
# This environment variable can prevent a common error on Windows $env:KMP_DUPLICATE_LIB_OK="TRUE" # Run the build script python scripts/build_index.py
This will create
faiss.indexandbm25.indexinservices/inference_api/artifacts/.
Run each service in its own separate terminal. The rag_env conda environment must be activated in all three.
| Terminal | Service | Commands |
|---|---|---|
| 1 | Inference API | conda activate rag_env$env:KMP_DUPLICATE_LIB_OK="TRUE"cd services/inference_apiuvicorn src.main:app --port 8001 |
| 2 | RAG API | conda activate rag_envcd services/rag_apiuvicorn src.main:app --port 8000 |
| 3 | Gradio UI | conda activate rag_envcd clients/gradio-demopython app.py |
Each service must run in its own terminal.
Once all services are running, open your browser and navigate to the local URL provided by Gradio (usually http://127.0.0.1:7860).
Click to view the detailed project tree
.
βββ clients/
β βββ gradio-demo/ # Frontend UI service
βββ data/ # Source documents and models (not in Git)
βββ packages/
β βββ shared-models/ # Shared Pydantic models for type-safe APIs
βββ scripts/
β βββ build_index.py # Processes data and creates search indexes
βββ services/
β βββ inference_api/ # Handles all ML model inference
β βββ rag_api/ # Orchestrates the RAG business logic
βββ .env.example # Environment variable template
βββ .gitignore
βββ LICENSE
βββ README.md
Contributions are welcome! Please feel free to fork the project, create a feature branch, and open a pull request.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/NewFeature) - Commit your Changes (
git commit -m 'Add some NewFeature') - Push to the Branch (
git push origin feature/NewFeature) - Open a Pull Request
Distributed under the MIT License. See LICENSE file for more information.
Kareem Sayed - LinkedIn - kareemsaid1232@gmail.com
Project Link: https://github.com/KareemSayed1232/Decoupled-Adaptive-Rag-Engine