Skip to content

sai-kumar-dev/ai-customer-support-chatbot

Repository files navigation

🏒 Company Internal Chatbot with Role-Based Access Control (RBAC)

A secure internal AI assistant built using Retrieval Augmented Generation (RAG) that allows employees to query company documents while enforcing strict role-based access control (RBAC).

This system is designed to simulate a real enterprise internal knowledge assistant, ensuring:

  • Users only access documents permitted by their role
  • AI responses are grounded strictly in authorized company data
  • Unauthorized data is never retrieved or generated
  • Every response is traceable to its source

πŸ“Œ Core Capabilities

  • πŸ” JWT-based authentication
  • 🧭 Strict role-based access control
  • πŸ“š Semantic search over company documents
  • 🧠 RAG pipeline with source attribution
  • πŸ”„ Pluggable LLM architecture (OpenAI / Groq / Stub)
  • πŸ—‚οΈ Vector database with metadata filtering
  • πŸ§ͺ RBAC validation and misuse testing
  • πŸ–₯️ Streamlit-based user interface for interaction

πŸ—οΈ High-Level Architecture


User (Streamlit / Swagger)
↓
Authentication (JWT)
↓
Role Extraction
↓
RBAC Enforcement
↓
Semantic Retrieval (Vector DB)
↓
Context Assembly
↓
LLM Generation (Optional)
↓
Answer + Sources

πŸ” Security and authorization are enforced before retrieval and generation.


πŸ› οΈ Technology Stack

Layer Technology
Backend API FastAPI
Frontend Streamlit
Vector Database Chroma
Embeddings Sentence Transformers
LLM OpenAI / Groq (optional)
Authentication OAuth2 + JWT
Database SQLite
Language Python 3.9+

πŸ“ Project Structure


company-chatbot/
β”‚
β”œβ”€β”€ app/                    # Backend application
β”‚   β”œβ”€β”€ main.py             # FastAPI entry point
β”‚   β”œβ”€β”€ auth.py             # Authentication & JWT logic
β”‚   β”œβ”€β”€ rbac.py             # Role hierarchy & permissions
β”‚   β”œβ”€β”€ search.py           # Semantic search with RBAC filtering
β”‚   β”œβ”€β”€ rag.py              # RAG pipeline
β”‚   β”œβ”€β”€ llm_client.py       # LLM abstraction layer
β”‚   β”œβ”€β”€ vectorstore.py      # Vector DB operations
β”‚
β”œβ”€β”€ frontend/
β”‚   └── app.py              # Streamlit frontend (main UI entry)
β”‚
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ explore_data.py     # Dataset inspection
β”‚   β”œβ”€β”€ preprocess_docs.py # Chunking & metadata tagging
β”‚   β”œβ”€β”€ build_vector_db.py # Embedding generation & indexing
β”‚   β”œβ”€β”€ test_search.py     # RBAC & retrieval validation
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                # Original documents (MD, CSV)
β”‚   β”œβ”€β”€ processed/          # Chunked & enriched documents
β”‚
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ README.md
└── .env.example


πŸ‘₯ User Roles & Permissions

Role Accessible Data
Employee General company handbook
Finance Finance + General
HR HR + General
Marketing Marketing + General
Engineering Engineering + General
C-Level Full access (all departments)

RBAC rules are enforced at retrieval time, not post-generation.


πŸš€ Getting Started (Local Setup)

Step 1: Clone the Repository

git clone https://github.com/sai-kumar-dev/company-chatbot.git
cd company-chatbot

Step 2: Create and Activate Virtual Environment

python -m venv venv

Activate it:

Windows

venv\Scripts\activate

Mac / Linux

source venv/bin/activate

Step 3: Install Dependencies

pip install -r requirements.txt

Ensure Python version is 3.9 or above.


πŸ“Š Data Preparation & Indexing (Mandatory)

This phase prepares company documents for semantic search.


Step 4: Explore the Dataset

python -m scripts.explore_data

This script:

  • Lists all departments
  • Shows document types (Markdown, CSV)
  • Previews content
  • Confirms data structure

πŸ“Œ Purpose: understand document scope and role mapping.


Step 5: Preprocess and Chunk Documents

python -m scripts.preprocess_docs

This performs:

  • Text cleaning

  • Section extraction

  • Chunking into ~300-token segments

  • Metadata enrichment:

    • department
    • source file
    • allowed roles

Output:

data/processed/document_chunks.jsonl

Each chunk is RBAC-aware.


Step 6: Build Vector Database

python -m scripts.build_vector_db

This step:

  • Generates embeddings using Sentence Transformers
  • Indexes chunks into Chroma
  • Stores metadata for secure filtering

This step is required only once, unless documents change.


Step 7: Validate Search & RBAC Enforcement

python -m scripts.test_search

This script verifies:

  • Same query returns different results for different roles
  • Unauthorized documents are never retrieved
  • Role hierarchy behaves correctly

This is critical validation evidence.


πŸ” Backend API (FastAPI)

Step 8: Start Backend Server

uvicorn app.main:app --reload

Server URL:

http://127.0.0.1:8000

API Documentation (Swagger UI)

Open:

http://127.0.0.1:8000/docs

Swagger UI provides:

  • OAuth2 password-based login
  • Automatic Bearer token handling
  • Interactive testing of secured endpoints

🧠 LLM Configuration (Optional)

By default, the system runs in stub mode (no external LLM calls).

Enable Groq (Recommended Free Tier)

Create .env file:

LLM_PROVIDER=groq
GROQ_API_KEY=your_api_key_here
GROQ_MODEL=llama3-8b-8192

Enable OpenAI

LLM_PROVIDER=openai
OPENAI_API_KEY=your_key_here
OPENAI_MODEL=gpt-4o-mini

Restart the backend after changing environment variables.


πŸ–₯️ Frontend (Streamlit Application)

Step 9: Run Streamlit App

streamlit run frontend/app.py

Application runs at:

http://localhost:8501

This is the main user-facing application.


πŸ”’ Security & Abuse Protection

  • RBAC enforced before retrieval
  • JWT required for all protected endpoints
  • Prompt injection cannot bypass permissions
  • LLM never receives unauthorized context
  • Source attribution ensures auditability

πŸ“¦ Milestones Overview

Milestone Description
Milestone 1 Data preparation & metadata tagging
Milestone 2 Vector DB & RBAC search
Milestone 3 Authentication & RAG pipeline
Milestone 4 Frontend, testing & documentation

🧠 Design Principles

  • Security-first architecture
  • Authorization before generation
  • Explicit access control
  • Provider-agnostic LLM integration
  • Enterprise-readiness over demos

πŸ“Œ Notes for Reviewers

To understand the system quickly:

  1. Start with scripts/test_search.py
  2. Then review app/search.py
  3. Then app/rag.py

These files represent the core logic.


πŸ“ˆ Future Improvements

  • Conversation memory
  • Usage analytics & audit logs
  • Admin dashboard
  • Fine-grained permissions
  • Multi-tenant support

βœ… Summary

This project demonstrates:

  • Secure AI system design
  • Production-style RBAC enforcement
  • Reliable RAG implementation
  • Clear separation of concerns
  • Strong emphasis on correctness and safety

About

AI-powered company chatbot for answering FAQs, support queries, and business information.

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages