Decoupled Adaptive RAG Engine

A powerful, adaptive Retrieval-Augmented Generation (RAG) system built with a decoupled microservices architecture. This project separates core AI/ML inference tasks from business logic, creating a scalable, maintainable, and developer-friendly solution.

Project Philosophy

The core idea is to decouple AI from logic. In many projects, ML inference code is tangled with application logic (data processing, routing, state management). This project avoids that by splitting the system into two main services:

RAG API (The Orchestrator): A lightweight service that manages the RAG pipeline, handles business logic, and communicates with the UI. It knows what to do.
Inference API (The Muscle): A dedicated, heavy-duty service that runs all the demanding ML models (LLMs, Embedders, Rerankers). It knows how to do it.

This separation provides immense benefits in scalability, independent development, and easier maintenance.

Key Features

Fully Decoupled Services: Scale, develop, and deploy the UI, logic, and ML services independently.
Real-Time Streaming: Delivers responses token-by-token for a dynamic and interactive user experience.
Adaptive Context Strategy: Intelligently decides how to use retrieved information based on confidence scores to prevent hallucinations.
Advanced RAG Pipeline: Incorporates HyDE, Hybrid Search, and Cross-Encoder Reranking for highly accurate and relevant results.
Clean API Design: Uses Pydantic for data validation and a shared data models package for type-safe communication between services.

How It Works: The RAG Pipeline

This engine employs a multi-stage process to generate the most accurate answers possible.

Query Expansion with HyDE: The user's query is first sent to a small LLM to generate a hypothetical document. This enriches the query's semantic meaning, leading to better search results.
Hybrid Search: The system performs two types of search in parallel:
- Dense Search (FAISS): Finds documents that are semantically similar in meaning.
- Sparse Search (BM25): Finds documents that match specific keywords.
Cross-Encoder Reranking: The results from both searches are passed to a powerful cross-encoder model. It directly compares the original query against each retrieved document to calculate a precise relevance score, filtering out noise and promoting the best context.
Adaptive Context Strategy: Based on the top reranker score, the engine makes a smart decision:
- High Confidence: Provide a rich, detailed context to the LLM.
- Medium Confidence: Summarize the retrieved documents to distill key facts.
- Low Confidence: Reject the context entirely to avoid making things up (hallucinating).
Final Generation: The curated context and original query are sent to the primary LLM to generate the final, streamed answer.

Architecture Overview

The microservices pattern ensures a clear separation of concerns. The user interacts with a web client, which communicates with the RAG API (orchestrator). This API, in turn, offloads all heavy ML computations to the dedicated Inference API.

flowchart TD
  subgraph UI["User Interface"]
    U(("User"))
    GD(["Gradio Web UI"])
  end

  subgraph RAG["🧠 RAG API @8000"]
    RagApp(["FastAPI Orchestrator"])
  end

  subgraph INF["💪 Inference API @8001"]
    IA(["FastAPI ML Endpoints"]) 
    Models(["LLM, Reranker, Embedder"])
    DB[("Vector/Keyword DB\nfaiss.index, bm25.index")]
  end

  U -- Sends Question --> GD
  GD -- HTTP POST /ask --> RagApp
  
  RagApp -- "<b>1. Generate HyDE</b>" --> IA
  RagApp -- "<b>2. Hybrid Search</b>" --> IA
  RagApp -- "<b>3. Rerank Chunks</b>" --> IA
  RagApp -- "<b>4. Decide Strategy</b>" --> RagApp
  RagApp -- "<b>5. Generate Final Answer</b>" --> IA
  
  IA -- Manages --> Models
  IA -- Manages --> DB
  
  IA -. Streaming Tokens .-> RagApp
  RagApp -. Streaming JSON .-> GD
  GD -. Renders Text .-> U

  style RagApp fill:#bbdefb,stroke:#1976d2,stroke-width:2px
  style IA fill:#c8e6c9,stroke:#388e3c,stroke-width:2px

Technology Stack

Category	Technology / Library	Purpose
Backend & API	`FastAPI`, `Uvicorn`	Building high-performance, asynchronous APIs for both services.
AI / Machine Learning	`llama-cpp-python`, `Sentence Transformers`, `mxbai-rerank`, `Transformers`	Running the LLM, generating embeddings, and reranking documents.
Vector & Keyword Search	`Faiss`, `rank_bm25`	Performing efficient similarity and keyword-based retrieval.
Frontend / UI	`Gradio`	Creating a rapid, interactive web interface.
Data & Configuration	`Pydantic`, `pydantic-settings`	Data validation, type safety, and environment configuration.
Communication	`HTTPX`	Asynchronous HTTP client for inter-service communication.

Getting Started on Windows

This guide provides a detailed process to ensure a smooth setup on a Windows machine with an NVIDIA GPU.

Prerequisites

An NVIDIA GPU with CUDA support.
Git: Download here.
Miniconda: Download here.
Python 3.11 (will be installed via Conda).

Step 1: System-Level Setup (C++ & CUDA)

Correctly installing the C++ compiler and CUDA Toolkit is the most critical step to avoid compilation errors.

1. Install Visual Studio 2022

llama-cpp-python requires a C++ compiler to be built from source if a pre-built wheel isn't used.

Download the Visual Studio 2022 Community installer from the official website.
Run the installer. On the Workloads tab, select "Desktop development with C++".
CRITICAL STEP: Go to the Individual components tab. Search for and select an older C++ toolset for maximum compatibility. A good choice is "MSVC v143 - VS 2022 C++ x64/x86 build tools (v14.38...)". This ensures that even if CUDA doesn't support the absolute latest compiler, a compatible one is available, preventing common build failures.
Click Install and wait for the process to complete.

CRITICAL NOTE: Only after visual studio is fully installed you can proceed for the next steps

2. Install NVIDIA CUDA Toolkit

Check your NVIDIA driver's supported CUDA version by opening PowerShell and running nvidia-smi.
Download the matching CUDA Toolkit from the NVIDIA Developer website.
Run the installer. When prompted, choose the Custom (Advanced) installation.
Ensure that "Visual Studio Integration" is checked. This is crucial for the toolkit to find your C++ compiler.
Rebooting the system is required after the installation.

Step 2: Project & Environment Setup

Clone the Repository:

git clone https://github.com/KareemSayed1232/Decoupled-Adaptive-Rag-Engine.git
cd Decoupled-Adaptive-Rag-Engine

Create and Activate Conda Environment:

conda create -n rag_env python=3.11 -y
conda activate rag_env

Step 3: Install GPU-Accelerated Libraries

These packages need to be installed carefully to enable GPU acceleration.

1. Install PyTorch with CUDA Support

Go to the PyTorch website and find the installation command for your specific system (select Pip, your OS, and your CUDA version).

Run the command with the cuda version:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/[YOUR CUDA VERSION]

# (for example if you have cuda 12.4)
# pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

Verify: python -c "import torch; print(f'PyTorch CUDA available: {torch.cuda.is_available()}')"

You should see: PyTorch CUDA available: True

2. Install `llama-cpp-python` with CUDA Support

Always try Option 1 first. It's the fastest and easiest method.

Option 1: Install Pre-built Wheel

The library maintainers at pypi provide Pre-built Wheel for multiple CUDA versions.

“It is recommended to follow the official instructions and use the recommended CUDA versions to ensure compatibility and avoid errors.”

Currently to Install the package, telling pip to look for the correct pre-built wheel. Replace cu124 below with your CUDA version (e.g., cu124, cu125).
```
# Example for CUDA 12.4
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124
```
If this command succeeds without errors, you are done with this step.

Option 2: Compile from Source (Fallback Method)

Use this method only if Option 1 fails or if a pre-built wheel is not available for your CUDA version. This uses the Visual Studio compiler we set up in Step 1.

Set the necessary environment variables in a PowerShell terminal.

# These flags tell the installer to build with CUDA support
$env:CMAKE_ARGS="-DGGML_CUDA=on"
$env:FORCE_CMAKE="1"

Run the installation. This will take several minutes to compile.
```
pip install --force-reinstall --no-cache-dir llama-cpp-python
```

After completing either option, verify the installation:

python -c "from llama_cpp import llama_supports_gpu_offload; print(f'llama.cpp GPU offload enabled: {llama_supports_gpu_offload()}')"

You should see: llama.cpp GPU offload enabled: True
Use Option 2 if llama.cpp GPU offload enabled: False

Step 4: Configure the Project

Create .env file: Create a copy of the example template.
```
# In PowerShell
Copy-Item .env.example .env
```

Edit .env file: Open the new .env file and update the paths to your local models and data.

Required Model Paths:

These paths must be updated to point to the location of your downloaded GGUF model files on your local machine.

Variable	Description	Example Value
`LLM_MODEL_PATH`	Path to the main Large Language Model file.	`data/models/guff/Qwen3-8B-Q5_K_M.gguf`
`HYDE_MODEL_PATH`	Path to the smaller LLM used for HyDE.	`data/models/guff/Phi-3-mini-4k-instruct-Q4_K_M.gguf`

Required Data Paths:

Variable	Description	Example Value
`BASE_CONTEXT_FILE`	Path to your introductory context file.	`data/base_context.txt`
`COMPLETE_CONTEXT_FILE`	Path to your full knowledge base file.	`data/complete_context.txt`

Step 5: Install Dependencies & Build Artifacts

Install Project Python Packages:

# Install the shared data models package first
pip install ./packages/shared-models

# Install dependencies for each service
pip install -r services/inference_api/requirements.txt
pip install -r services/rag_api/requirements.txt
pip install -r clients/gradio-demo/requirements.txt

Build Search Indexes: This script processes your documents and creates the search artifacts.
```
# This environment variable can prevent a common error on Windows
$env:KMP_DUPLICATE_LIB_OK="TRUE"

# Run the build script
python scripts/build_index.py
```
This will create faiss.index and bm25.index in services/inference_api/artifacts/.

Step 6: Run the Application

Run each service in its own separate terminal. The rag_env conda environment must be activated in all three.

Terminal	Service	Commands
1	Inference API	`conda activate rag_env` `$env:KMP_DUPLICATE_LIB_OK="TRUE"` `cd services/inference_api` `uvicorn src.main:app --port 8001`
2	RAG API	`conda activate rag_env` `cd services/rag_api` `uvicorn src.main:app --port 8000`
3	Gradio UI	`conda activate rag_env` `cd clients/gradio-demo` `python app.py`

Each service must run in its own terminal.
Once all services are running, open your browser and navigate to the local URL provided by Gradio (usually http://127.0.0.1:7860).

Project Structure

Click to view the detailed project tree

.
├── clients/
│   └── gradio-demo/          # Frontend UI service
├── data/                     # Source documents and models (not in Git)
├── packages/
│   └── shared-models/        # Shared Pydantic models for type-safe APIs
├── scripts/
│   └── build_index.py        # Processes data and creates search indexes
├── services/
│   ├── inference_api/        # Handles all ML model inference
│   └── rag_api/              # Orchestrates the RAG business logic
├── .env.example              # Environment variable template
├── .gitignore
├── LICENSE
└── README.md

Contributing

Contributions are welcome! Please feel free to fork the project, create a feature branch, and open a pull request.

Fork the Project
Create your Feature Branch (git checkout -b feature/NewFeature)
Commit your Changes (git commit -m 'Add some NewFeature')
Push to the Branch (git push origin feature/NewFeature)
Open a Pull Request

License

Distributed under the MIT License. See LICENSE file for more information.

Contact

Kareem Sayed - LinkedIn - kareemsaid1232@gmail.com

Project Link: https://github.com/KareemSayed1232/Decoupled-Adaptive-Rag-Engine

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Decoupled Adaptive RAG Engine

Table of Contents

Project Philosophy

Key Features

How It Works: The RAG Pipeline

Architecture Overview

Technology Stack

Getting Started on Windows

Prerequisites

Step 1: System-Level Setup (C++ & CUDA)

1. Install Visual Studio 2022

2. Install NVIDIA CUDA Toolkit

Step 2: Project & Environment Setup

Step 3: Install GPU-Accelerated Libraries

1. Install PyTorch with CUDA Support

2. Install `llama-cpp-python` with CUDA Support

Step 4: Configure the Project

Step 5: Install Dependencies & Build Artifacts

Step 6: Run the Application

Project Structure

Contributing

License

Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
clients/gradio-demo		clients/gradio-demo
packages/shared-models		packages/shared-models
scripts		scripts
services		services
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compose.yaml		compose.yaml

License

KareemSayed1232/Decoupled-Adaptive-RAG-Engine

Folders and files

Latest commit

History

Repository files navigation

Decoupled Adaptive RAG Engine

Table of Contents

Project Philosophy

Key Features

How It Works: The RAG Pipeline

Architecture Overview

Technology Stack

Getting Started on Windows

Prerequisites

Step 1: System-Level Setup (C++ & CUDA)

1. Install Visual Studio 2022

2. Install NVIDIA CUDA Toolkit

Step 2: Project & Environment Setup

Step 3: Install GPU-Accelerated Libraries

1. Install PyTorch with CUDA Support

2. Install llama-cpp-python with CUDA Support

Step 4: Configure the Project

Step 5: Install Dependencies & Build Artifacts

Step 6: Run the Application

Project Structure

Contributing

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

2. Install `llama-cpp-python` with CUDA Support

Packages