🏢 Company Internal Chatbot with Role-Based Access Control (RBAC)

A secure internal AI assistant built using Retrieval Augmented Generation (RAG) that allows employees to query company documents while enforcing strict role-based access control (RBAC).

This system is designed to simulate a real enterprise internal knowledge assistant, ensuring:

Users only access documents permitted by their role
AI responses are grounded strictly in authorized company data
Unauthorized data is never retrieved or generated
Every response is traceable to its source

📌 Core Capabilities

🔐 JWT-based authentication
🧭 Strict role-based access control
📚 Semantic search over company documents
🧠 RAG pipeline with source attribution
🔄 Pluggable LLM architecture (OpenAI / Groq / Stub)
🗂️ Vector database with metadata filtering
🧪 RBAC validation and misuse testing
🖥️ Streamlit-based user interface for interaction

🏗️ High-Level Architecture


User (Streamlit / Swagger)
↓
Authentication (JWT)
↓
Role Extraction
↓
RBAC Enforcement
↓
Semantic Retrieval (Vector DB)
↓
Context Assembly
↓
LLM Generation (Optional)
↓
Answer + Sources

🔐 Security and authorization are enforced before retrieval and generation.

🛠️ Technology Stack

Layer	Technology
Backend API	FastAPI
Frontend	Streamlit
Vector Database	Chroma
Embeddings	Sentence Transformers
LLM	OpenAI / Groq (optional)
Authentication	OAuth2 + JWT
Database	SQLite
Language	Python 3.9+

📁 Project Structure


company-chatbot/
│
├── app/                    # Backend application
│   ├── main.py             # FastAPI entry point
│   ├── auth.py             # Authentication & JWT logic
│   ├── rbac.py             # Role hierarchy & permissions
│   ├── search.py           # Semantic search with RBAC filtering
│   ├── rag.py              # RAG pipeline
│   ├── llm_client.py       # LLM abstraction layer
│   ├── vectorstore.py      # Vector DB operations
│
├── frontend/
│   └── app.py              # Streamlit frontend (main UI entry)
│
├── scripts/
│   ├── explore_data.py     # Dataset inspection
│   ├── preprocess_docs.py # Chunking & metadata tagging
│   ├── build_vector_db.py # Embedding generation & indexing
│   ├── test_search.py     # RBAC & retrieval validation
│
├── data/
│   ├── raw/                # Original documents (MD, CSV)
│   ├── processed/          # Chunked & enriched documents
│
├── requirements.txt
├── README.md
└── .env.example

👥 User Roles & Permissions

Role	Accessible Data
Employee	General company handbook
Finance	Finance + General
HR	HR + General
Marketing	Marketing + General
Engineering	Engineering + General
C-Level	Full access (all departments)

RBAC rules are enforced at retrieval time, not post-generation.

🚀 Getting Started (Local Setup)

Step 1: Clone the Repository

git clone https://github.com/sai-kumar-dev/company-chatbot.git
cd company-chatbot

Step 2: Create and Activate Virtual Environment

python -m venv venv

Activate it:

Windows

venv\Scripts\activate

Mac / Linux

source venv/bin/activate

Step 3: Install Dependencies

pip install -r requirements.txt

Ensure Python version is 3.9 or above.

📊 Data Preparation & Indexing (Mandatory)

This phase prepares company documents for semantic search.

Step 4: Explore the Dataset

python -m scripts.explore_data

This script:

Lists all departments
Shows document types (Markdown, CSV)
Previews content
Confirms data structure

📌 Purpose: understand document scope and role mapping.

Step 5: Preprocess and Chunk Documents

python -m scripts.preprocess_docs

This performs:

Text cleaning
Section extraction
Chunking into ~300-token segments
Metadata enrichment:
- department
- source file
- allowed roles

Output:

data/processed/document_chunks.jsonl

Each chunk is RBAC-aware.

Step 6: Build Vector Database

python -m scripts.build_vector_db

This step:

Generates embeddings using Sentence Transformers
Indexes chunks into Chroma
Stores metadata for secure filtering

This step is required only once, unless documents change.

Step 7: Validate Search & RBAC Enforcement

python -m scripts.test_search

This script verifies:

Same query returns different results for different roles
Unauthorized documents are never retrieved
Role hierarchy behaves correctly

This is critical validation evidence.

🔐 Backend API (FastAPI)

Step 8: Start Backend Server

uvicorn app.main:app --reload

Server URL:

http://127.0.0.1:8000

API Documentation (Swagger UI)

Open:

http://127.0.0.1:8000/docs

Swagger UI provides:

OAuth2 password-based login
Automatic Bearer token handling
Interactive testing of secured endpoints

🧠 LLM Configuration (Optional)

By default, the system runs in stub mode (no external LLM calls).

Enable Groq (Recommended Free Tier)

Create .env file:

LLM_PROVIDER=groq
GROQ_API_KEY=your_api_key_here
GROQ_MODEL=llama3-8b-8192

Enable OpenAI

LLM_PROVIDER=openai
OPENAI_API_KEY=your_key_here
OPENAI_MODEL=gpt-4o-mini

Restart the backend after changing environment variables.

🖥️ Frontend (Streamlit Application)

Step 9: Run Streamlit App

streamlit run frontend/app.py

Application runs at:

http://localhost:8501

This is the main user-facing application.

🔒 Security & Abuse Protection

RBAC enforced before retrieval
JWT required for all protected endpoints
Prompt injection cannot bypass permissions
LLM never receives unauthorized context
Source attribution ensures auditability

📦 Milestones Overview

Milestone	Description
Milestone 1	Data preparation & metadata tagging
Milestone 2	Vector DB & RBAC search
Milestone 3	Authentication & RAG pipeline
Milestone 4	Frontend, testing & documentation

🧠 Design Principles

Security-first architecture
Authorization before generation
Explicit access control
Provider-agnostic LLM integration
Enterprise-readiness over demos

📌 Notes for Reviewers

To understand the system quickly:

Start with scripts/test_search.py
Then review app/search.py
Then app/rag.py

These files represent the core logic.

📈 Future Improvements

Conversation memory
Usage analytics & audit logs
Admin dashboard
Fine-grained permissions
Multi-tenant support

✅ Summary

This project demonstrates:

Secure AI system design
Production-style RBAC enforcement
Reliable RAG implementation
Clear separation of concerns
Strong emphasis on correctness and safety

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏢 Company Internal Chatbot with Role-Based Access Control (RBAC)

📌 Core Capabilities

🏗️ High-Level Architecture

🛠️ Technology Stack

📁 Project Structure

👥 User Roles & Permissions

🚀 Getting Started (Local Setup)

Step 1: Clone the Repository

Step 2: Create and Activate Virtual Environment

Step 3: Install Dependencies

📊 Data Preparation & Indexing (Mandatory)

Step 4: Explore the Dataset

Step 5: Preprocess and Chunk Documents

Step 6: Build Vector Database

Step 7: Validate Search & RBAC Enforcement

🔐 Backend API (FastAPI)

Step 8: Start Backend Server

API Documentation (Swagger UI)

🧠 LLM Configuration (Optional)

Enable Groq (Recommended Free Tier)

Enable OpenAI

🖥️ Frontend (Streamlit Application)

Step 9: Run Streamlit App

🔒 Security & Abuse Protection

📦 Milestones Overview

🧠 Design Principles

📌 Notes for Reviewers

📈 Future Improvements

✅ Summary

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.devcontainer		.devcontainer
app		app
data		data
documentation		documentation
frontend		frontend
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
Architecture Diagram.png		Architecture Diagram.png
Company Internal Chatbot with RBAC.pptx		Company Internal Chatbot with RBAC.pptx
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🏢 Company Internal Chatbot with Role-Based Access Control (RBAC)

📌 Core Capabilities

🏗️ High-Level Architecture

🛠️ Technology Stack

📁 Project Structure

👥 User Roles & Permissions

🚀 Getting Started (Local Setup)

Step 1: Clone the Repository

Step 2: Create and Activate Virtual Environment

Step 3: Install Dependencies

📊 Data Preparation & Indexing (Mandatory)

Step 4: Explore the Dataset

Step 5: Preprocess and Chunk Documents

Step 6: Build Vector Database

Step 7: Validate Search & RBAC Enforcement

🔐 Backend API (FastAPI)

Step 8: Start Backend Server

API Documentation (Swagger UI)

🧠 LLM Configuration (Optional)

Enable Groq (Recommended Free Tier)

Enable OpenAI

🖥️ Frontend (Streamlit Application)

Step 9: Run Streamlit App

🔒 Security & Abuse Protection

📦 Milestones Overview

🧠 Design Principles

📌 Notes for Reviewers

📈 Future Improvements

✅ Summary

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages