Skip to content

Production customer support chatbot with PEFT/LoRA fine-tuning and advanced prompt engineering

License

Notifications You must be signed in to change notification settings

samuelvinay91/customer-support-chatbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ€– Customer Support Chatbot

Python 3.11+ License: MIT Docker FastAPI

A production-grade customer support chatbot that combines advanced prompt engineering, PEFT/LoRA fine-tuning, LLM-based intent classification, and Redis-backed conversation management. Built to demonstrate how real-world AI support systems are designed -- from the system prompt all the way to the training pipeline.

Chatbot Screenshot


πŸ“š What You'll Learn

Concept Description
Prompt Engineering System prompts, few-shot examples, chain-of-thought reasoning, template versioning
Fine-tuning with PEFT Parameter-Efficient Fine-Tuning -- train an LLM on your data without full model weights
LoRA Adapters Low-Rank Adaptation -- how and why it works, hands-on training pipeline
Conversation Management Finite state machines, sliding-window context, LLM-based summarization
Intent Classification LLM-powered multi-label classification with structured JSON output
Production Patterns Redis session storage, streaming responses, health checks, error handling

πŸ—οΈ Architecture

                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                         β”‚   Client     β”‚
                         β”‚  (REST/SSE)  β”‚
                         β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚     FastAPI Application   β”‚
                    β”‚         (api.py)          β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
           β”‚                    β”‚                     β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   Intent    β”‚    β”‚    Prompt      β”‚    β”‚  Conversation   β”‚
    β”‚ Classifier  β”‚    β”‚   Registry     β”‚    β”‚    Manager      β”‚
    β”‚(classifier  β”‚    β”‚  (prompts.py)  β”‚    β”‚(conversation.py)β”‚
    β”‚    .py)     β”‚    β”‚                β”‚    β”‚                 β”‚
    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚                   β”‚                     β”‚
           β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
           β”‚  β”‚ billing   β”‚    β”‚  β”‚ YAML         β”‚   β”‚  β”‚  Redis  β”‚
           β”œβ”€β”€β”‚ technical β”‚    β”œβ”€β”€β”‚ Templates    β”‚   └──│ Sessionsβ”‚
           β”‚  β”‚ account   β”‚    β”‚  β”‚ + Versions   β”‚      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚  β”‚ general   β”‚    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚  β”‚ escalationβ”‚    β”‚
           β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
           β”‚                   β”œβ”€β”€β”‚ Few-Shot Mgr β”‚
           β”‚                   β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚                   β”‚
           β”‚                   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
           β”‚                   └──│ CoT Template β”‚
           β”‚                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚                  LLM Provider                    β”‚
    β”‚            (Anthropic Claude API)                 β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚            Fine-tuning Pipeline                   β”‚
    β”‚            (finetuning.py)                        β”‚
    β”‚                                                  β”‚
    β”‚  DatasetPreparator β†’ LoRA Config β†’ SFTTrainer    β”‚
    β”‚       ↓                                          β”‚
    β”‚  Base Model β†’ QLoRA (4-bit) β†’ Train β†’ Export     β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Quick Start

Option 1: Docker (Recommended)

docker build -f Dockerfile \
  -t customer-support-chatbot .

# Run with your API key
docker run -p 8000:8000 \
  -e CHATBOT_ANTHROPIC_API_KEY=your-key \
  customer-support-chatbot

Option 2: Local Development

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -e ".[dev]"

# Set environment variables
export CHATBOT_ANTHROPIC_API_KEY=your-key

# (Optional) Start Redis for conversation persistence
docker run -d -p 6379:6379 redis:7-alpine

# Run the server
# Already in project root
python -m customer_support.main

The API will be available at http://localhost:8000. Interactive docs at http://localhost:8000/docs.


πŸ“‘ API Reference

Health Check

curl http://localhost:8000/health

Chat (Synchronous)

curl -X POST http://localhost:8000/api/v1/chat \
  -H "Content-Type: application/json" \
  -d '{
    "message": "I was charged twice for my subscription this month",
    "customer_name": "Alice",
    "use_chain_of_thought": true
  }'

Response includes the reply, detected intent, confidence, conversation state, and sentiment trend.

Chat (Streaming via SSE)

curl -N -X POST http://localhost:8000/api/v1/chat/stream \
  -H "Content-Type: application/json" \
  -d '{"message": "How do I reset my password?", "session_id": "abc123"}'

Intent Classification

curl -X POST http://localhost:8000/api/v1/classify \
  -H "Content-Type: application/json" \
  -d '{
    "message": "Your app keeps crashing whenever I try to upload photos",
    "context": [
      {"role": "user", "content": "I need help with the mobile app"},
      {"role": "assistant", "content": "Of course! What issue are you experiencing?"}
    ]
  }'

Conversation History

curl http://localhost:8000/api/v1/conversations/abc123

Prompt Templates

# List all registered templates
curl http://localhost:8000/api/v1/prompts

# Test-render a prompt template
curl -X POST http://localhost:8000/api/v1/prompts/test \
  -H "Content-Type: application/json" \
  -d '{
    "template_name": "billing_support",
    "knowledge_base": "Refund policy: full refunds within 30 days."
  }'

πŸ”¬ Implementation Deep Dive

1. Prompt Engineering

The chatbot uses a composable, version-controlled prompt system built on four layers:

Layer 1 -- SystemPromptBuilder (Fluent API):

prompt = (
    SystemPromptBuilder()
    .with_role("Acme Corp Customer Support Agent")
    .with_knowledge_base(kb_text)
    .with_tone("professional", "empathetic", "concise")
    .with_escalation_rules(rules)
    .with_response_format(fmt)
    .with_guardrails(safety_policy)
    .with_chain_of_thought(visible=False)
    .with_few_shot_examples(manager, "billing")
    .build()
)

Each section is wrapped in XML tags (<role>, <knowledge_base>, <tone_guidelines>, etc.) for clear prompt structure. Sections are assembled in a deterministic priority order.

Layer 2 -- YAML Templates with Versioning:

# data/templates/billing_support.yaml
name: billing_support
description: Handles billing, refunds, and subscription queries
version: "1.2.0"
chain_of_thought_enabled: true
system_prompt: |
  You are a billing support specialist for Acme Corp...
few_shot_examples:
  - user: "I was charged twice this month."
    assistant: "I'm sorry about the duplicate charge. Let me look into..."
    tags: [billing, refund, duplicate-charge]

Every template is snapshotted with a SHA-256 hash, so prompt changes are trackable and reversible.

Layer 3 -- Few-Shot Examples:

The FewShotManager organizes examples by category and renders them as XML blocks inside the system prompt:

<examples>
  <example>
    <user>I was charged twice this month.</user>
    <assistant>I'm sorry about the duplicate charge...</assistant>
  </example>
</examples>

Layer 4 -- Chain-of-Thought:

For complex issues, the ChainOfThoughtTemplate wraps the user message in a structured reasoning framework:

Step 1 -- Problem identification: What is the core issue?
Step 2 -- Context gathering: What additional info is relevant?
Step 3 -- Solution exploration: List 2-3 possible approaches
Step 4 -- Escalation check: Does this require a human?
Step 5 -- Response composition: Draft the final response

The model reasons internally and outputs only the final customer-facing message.

2. LoRA Fine-tuning

What is LoRA? Low-Rank Adaptation freezes the pre-trained model weights and injects small trainable matrices into specific layers:

Original weight matrix W (d x d):        4096 x 4096 = 16.7M params

LoRA decomposition:
  W' = W + (A x B)
  where A is (d x r) and B is (r x d)

With rank r=16:
  A: 4096 x 16 =   65K params  ┐
  B: 16 x 4096 =   65K params  β”œβ”€β”€ 0.78% of original!
                    130K params β”˜

The Training Pipeline:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Load JSONL  │───▢│ Apply Chat   │───▢│  Tokenize   │───▢│  Train   β”‚
β”‚ Training    β”‚    β”‚ Template     β”‚    β”‚  (truncate  β”‚    β”‚  (SFT +  β”‚
β”‚ Data        β”‚    β”‚ (format for  β”‚    β”‚   to 2048)  β”‚    β”‚  QLoRA)  β”‚
β”‚             β”‚    β”‚  the model)  β”‚    β”‚             β”‚    β”‚          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
                                                               β”‚
                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”
                   β”‚ Export       │◀───│  Evaluate   │◀───│ Validate β”‚
                   β”‚ Adapter      β”‚    β”‚  (loss,     β”‚    β”‚ (held-   β”‚
                   β”‚ (~50MB)      β”‚    β”‚  perplexity)β”‚    β”‚  out set)β”‚
                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key configuration from LoRAConfig:

Parameter Default What it controls
rank 16 Size of low-rank matrices (higher = more capacity)
alpha 32 Scaling factor (rule of thumb: 2x rank)
dropout 0.05 Regularization to prevent overfitting
target_modules q/k/v/o/gate/up/down_proj Which layers get adapters
quantization_bits 4 QLoRA: 4-bit NormalFloat quantization
learning_rate 2e-4 Peak LR with cosine schedule

Run the pipeline:

python -m customer_support.finetuning \
  --base-model mistralai/Mistral-7B-Instruct-v0.3 \
  --dataset data/training/support_conversations.jsonl \
  --output-dir ./training_output \
  --epochs 3 --lora-rank 16

3. Conversation Management

The conversation module implements a finite state machine with Redis-backed persistence:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    1st user msg    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ GREETING │───────────────────▢│ UNDERSTANDING β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β”‚ 2+ turns
                                β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
                                β”‚   RESOLVING   │◄─────┐
                                β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
                                        β”‚              β”‚ new issue
                                β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”      β”‚
                  satisfaction  β”‚    CLOSING    β”‚β”€β”€β”€β”€β”€β”€β”˜
                   signal       β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β”‚
                                β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
                                β”‚    CLOSED     β”‚
                                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  Any state ───(anger/legal/security)───▢ ESCALATED ───▢ CLOSED

Sliding-Window Context: When conversations exceed max_history messages (default: 20), older messages are summarized and prepended to maintain context without exceeding token limits:

[Summary of turns 1-15] + [Full messages 16-20] β†’ LLM

Sentiment Tracking: Each user message records a sentiment level (very_negative to very_positive). The system computes a trend (improving / stable / deteriorating) to trigger proactive escalation.

4. Intent Classification

The classifier uses an LLM-based approach with structured JSON output rather than a traditional ML model:

Input:  "Your app keeps crashing whenever I try to upload photos"

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  LLM Classification (temperature=0.0 for determinism)β”‚
β”‚                                                      β”‚
β”‚  System Prompt: taxonomy definition + JSON schema    β”‚
β”‚  User Message:  the customer's text                  β”‚
β”‚  Context:       last 4 messages (optional)           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                   β”‚
Output JSON:                       β–Ό
{
  "primary_intent": "technical",
  "primary_confidence": 0.92,
  "secondary_intents": [
    {"intent": "escalation", "confidence": 0.35}
  ],
  "reasoning": "User reports app crash during photo upload - technical issue"
}

Why LLM-based instead of traditional ML?

Factor LLM Classification Traditional ML (e.g. BERT)
Training data needed Zero (zero-shot) Hundreds-thousands of labeled examples
New intent support Update the prompt Retrain the model
Explainability Built-in reasoning field Requires separate explanation model
Latency ~200-500ms ~10-50ms
Cost API call per message One-time training cost

The classification result drives routing to the appropriate prompt template (billing_support, technical_support, or general_support).


πŸ› οΈ Tech Stack

Layer Technology Purpose
Framework FastAPI Async REST API with OpenAPI docs
LLM Provider Anthropic Claude Chat completions and classification
Prompt Management YAML + Jinja2 Version-controlled prompt templates
Session Storage Redis Conversation persistence with TTL
Fine-tuning PEFT, LoRA, bitsandbytes Parameter-efficient model adaptation
Training HuggingFace Transformers, TRL SFTTrainer with QLoRA support
Experiment Tracking Weights & Biases Training metrics and model comparison
Streaming SSE-Starlette Real-time token streaming
Database PostgreSQL + SQLAlchemy (Optional) structured data storage
Config Pydantic Settings Type-safe environment configuration
Logging structlog Structured JSON logging
Containerization Docker (multi-stage) Secure, slim production image

πŸ“ Project Structure

02-customer-support-chatbot/
β”œβ”€β”€ src/customer_support/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ main.py              # Uvicorn entry point
β”‚   β”œβ”€β”€ api.py               # FastAPI app: chat, classify, prompts, conversations
β”‚   β”œβ”€β”€ config.py            # Settings (env vars, model config, Redis URL)
β”‚   β”œβ”€β”€ prompts.py           # SystemPromptBuilder, PromptRegistry, FewShotManager, CoT
β”‚   β”œβ”€β”€ classifier.py        # IntentClassifier with LLM-based multi-label classification
β”‚   β”œβ”€β”€ conversation.py      # ConversationManager, state machine, sentiment tracking
β”‚   └── finetuning.py        # LoRA training pipeline (DatasetPreparator, FineTuningPipeline)
β”œβ”€β”€ data/templates/
β”‚   β”œβ”€β”€ billing_support.yaml
β”‚   β”œβ”€β”€ technical_support.yaml
β”‚   └── general_support.yaml
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ conftest.py
β”‚   β”œβ”€β”€ test_api.py
β”‚   └── test_prompts.py
β”œβ”€β”€ k8s/
β”‚   └── deployment.yaml
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ pyproject.toml
└── README.md

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/my-feature
  3. Install dev dependencies: pip install -e ".[dev]"
  4. Run tests: pytest tests/ -v
  5. For fine-tuning work, install training extras: pip install -e ".[training]"
  6. Submit a pull request

πŸ“„ License

This project is part of the AI Engineer Portfolio and is licensed under the MIT License.

About

Production customer support chatbot with PEFT/LoRA fine-tuning and advanced prompt engineering

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published