A production-grade customer support chatbot that combines advanced prompt engineering, PEFT/LoRA fine-tuning, LLM-based intent classification, and Redis-backed conversation management. Built to demonstrate how real-world AI support systems are designed -- from the system prompt all the way to the training pipeline.
| Concept | Description |
|---|---|
| Prompt Engineering | System prompts, few-shot examples, chain-of-thought reasoning, template versioning |
| Fine-tuning with PEFT | Parameter-Efficient Fine-Tuning -- train an LLM on your data without full model weights |
| LoRA Adapters | Low-Rank Adaptation -- how and why it works, hands-on training pipeline |
| Conversation Management | Finite state machines, sliding-window context, LLM-based summarization |
| Intent Classification | LLM-powered multi-label classification with structured JSON output |
| Production Patterns | Redis session storage, streaming responses, health checks, error handling |
ββββββββββββββββ
β Client β
β (REST/SSE) β
ββββββββ¬ββββββββ
β
βββββββββββββΌβββββββββββββββ
β FastAPI Application β
β (api.py) β
βββββββββββββ¬βββββββββββββββ
β
ββββββββββββββββββββββΌβββββββββββββββββββββ
β β β
ββββββββΌβββββββ βββββββββΌβββββββββ ββββββββΌβββββββββββ
β Intent β β Prompt β β Conversation β
β Classifier β β Registry β β Manager β
β(classifier β β (prompts.py) β β(conversation.py)β
β .py) β β β β β
ββββββββ¬βββββββ βββββββββ¬βββββββββ ββββββββ¬βββββββββββ
β β β
β βββββββββββββ β ββββββββββββββββ β βββββββββββ
β β billing β β β YAML β β β Redis β
ββββ technical β ββββ Templates β ββββ Sessionsβ
β β account β β β + Versions β βββββββββββ
β β general β β ββββββββββββββββ
β β escalationβ β
β βββββββββββββ β ββββββββββββββββ
β ββββ Few-Shot Mgr β
β β ββββββββββββββββ
β β
β β ββββββββββββββββ
β ββββ CoT Template β
β ββββββββββββββββ
β
ββββββββΌβββββββββββββββββββββββββββββββββββββββββββ
β LLM Provider β
β (Anthropic Claude API) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββΌβββββββββββββββββββββββββββββββββββββββββββ
β Fine-tuning Pipeline β
β (finetuning.py) β
β β
β DatasetPreparator β LoRA Config β SFTTrainer β
β β β
β Base Model β QLoRA (4-bit) β Train β Export β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
docker build -f Dockerfile \
-t customer-support-chatbot .
# Run with your API key
docker run -p 8000:8000 \
-e CHATBOT_ANTHROPIC_API_KEY=your-key \
customer-support-chatbot# Create virtual environment
python -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -e ".[dev]"
# Set environment variables
export CHATBOT_ANTHROPIC_API_KEY=your-key
# (Optional) Start Redis for conversation persistence
docker run -d -p 6379:6379 redis:7-alpine
# Run the server
# Already in project root
python -m customer_support.mainThe API will be available at http://localhost:8000. Interactive docs at http://localhost:8000/docs.
curl http://localhost:8000/healthcurl -X POST http://localhost:8000/api/v1/chat \
-H "Content-Type: application/json" \
-d '{
"message": "I was charged twice for my subscription this month",
"customer_name": "Alice",
"use_chain_of_thought": true
}'Response includes the reply, detected intent, confidence, conversation state, and sentiment trend.
curl -N -X POST http://localhost:8000/api/v1/chat/stream \
-H "Content-Type: application/json" \
-d '{"message": "How do I reset my password?", "session_id": "abc123"}'curl -X POST http://localhost:8000/api/v1/classify \
-H "Content-Type: application/json" \
-d '{
"message": "Your app keeps crashing whenever I try to upload photos",
"context": [
{"role": "user", "content": "I need help with the mobile app"},
{"role": "assistant", "content": "Of course! What issue are you experiencing?"}
]
}'curl http://localhost:8000/api/v1/conversations/abc123# List all registered templates
curl http://localhost:8000/api/v1/prompts
# Test-render a prompt template
curl -X POST http://localhost:8000/api/v1/prompts/test \
-H "Content-Type: application/json" \
-d '{
"template_name": "billing_support",
"knowledge_base": "Refund policy: full refunds within 30 days."
}'The chatbot uses a composable, version-controlled prompt system built on four layers:
Layer 1 -- SystemPromptBuilder (Fluent API):
prompt = (
SystemPromptBuilder()
.with_role("Acme Corp Customer Support Agent")
.with_knowledge_base(kb_text)
.with_tone("professional", "empathetic", "concise")
.with_escalation_rules(rules)
.with_response_format(fmt)
.with_guardrails(safety_policy)
.with_chain_of_thought(visible=False)
.with_few_shot_examples(manager, "billing")
.build()
)Each section is wrapped in XML tags (<role>, <knowledge_base>, <tone_guidelines>, etc.) for clear prompt structure. Sections are assembled in a deterministic priority order.
Layer 2 -- YAML Templates with Versioning:
# data/templates/billing_support.yaml
name: billing_support
description: Handles billing, refunds, and subscription queries
version: "1.2.0"
chain_of_thought_enabled: true
system_prompt: |
You are a billing support specialist for Acme Corp...
few_shot_examples:
- user: "I was charged twice this month."
assistant: "I'm sorry about the duplicate charge. Let me look into..."
tags: [billing, refund, duplicate-charge]Every template is snapshotted with a SHA-256 hash, so prompt changes are trackable and reversible.
Layer 3 -- Few-Shot Examples:
The FewShotManager organizes examples by category and renders them as XML blocks inside the system prompt:
<examples>
<example>
<user>I was charged twice this month.</user>
<assistant>I'm sorry about the duplicate charge...</assistant>
</example>
</examples>Layer 4 -- Chain-of-Thought:
For complex issues, the ChainOfThoughtTemplate wraps the user message in a structured reasoning framework:
Step 1 -- Problem identification: What is the core issue?
Step 2 -- Context gathering: What additional info is relevant?
Step 3 -- Solution exploration: List 2-3 possible approaches
Step 4 -- Escalation check: Does this require a human?
Step 5 -- Response composition: Draft the final response
The model reasons internally and outputs only the final customer-facing message.
What is LoRA? Low-Rank Adaptation freezes the pre-trained model weights and injects small trainable matrices into specific layers:
Original weight matrix W (d x d): 4096 x 4096 = 16.7M params
LoRA decomposition:
W' = W + (A x B)
where A is (d x r) and B is (r x d)
With rank r=16:
A: 4096 x 16 = 65K params β
B: 16 x 4096 = 65K params βββ 0.78% of original!
130K params β
The Training Pipeline:
βββββββββββββββ ββββββββββββββββ βββββββββββββββ ββββββββββββ
β Load JSONL βββββΆβ Apply Chat βββββΆβ Tokenize βββββΆβ Train β
β Training β β Template β β (truncate β β (SFT + β
β Data β β (format for β β to 2048) β β QLoRA) β
β β β the model) β β β β β
βββββββββββββββ ββββββββββββββββ βββββββββββββββ ββββββ¬ββββββ
β
ββββββββββββββββ βββββββββββββββ ββββββΌββββββ
β Export ββββββ Evaluate ββββββ Validate β
β Adapter β β (loss, β β (held- β
β (~50MB) β β perplexity)β β out set)β
ββββββββββββββββ βββββββββββββββ ββββββββββββ
Key configuration from LoRAConfig:
| Parameter | Default | What it controls |
|---|---|---|
rank |
16 | Size of low-rank matrices (higher = more capacity) |
alpha |
32 | Scaling factor (rule of thumb: 2x rank) |
dropout |
0.05 | Regularization to prevent overfitting |
target_modules |
q/k/v/o/gate/up/down_proj | Which layers get adapters |
quantization_bits |
4 | QLoRA: 4-bit NormalFloat quantization |
learning_rate |
2e-4 | Peak LR with cosine schedule |
Run the pipeline:
python -m customer_support.finetuning \
--base-model mistralai/Mistral-7B-Instruct-v0.3 \
--dataset data/training/support_conversations.jsonl \
--output-dir ./training_output \
--epochs 3 --lora-rank 16The conversation module implements a finite state machine with Redis-backed persistence:
ββββββββββββ 1st user msg βββββββββββββββββ
β GREETING βββββββββββββββββββββΆβ UNDERSTANDING β
ββββββββββββ βββββββββ¬ββββββββ
β 2+ turns
βββββββββΌββββββββ
β RESOLVING ββββββββ
βββββββββ¬ββββββββ β
β β new issue
βββββββββΌββββββββ β
satisfaction β CLOSING ββββββββ
signal βββββββββ¬ββββββββ
β
βββββββββΌββββββββ
β CLOSED β
βββββββββββββββββ
Any state βββ(anger/legal/security)ββββΆ ESCALATED ββββΆ CLOSED
Sliding-Window Context: When conversations exceed max_history messages (default: 20), older messages are summarized and prepended to maintain context without exceeding token limits:
[Summary of turns 1-15] + [Full messages 16-20] β LLM
Sentiment Tracking: Each user message records a sentiment level (very_negative to very_positive). The system computes a trend (improving / stable / deteriorating) to trigger proactive escalation.
The classifier uses an LLM-based approach with structured JSON output rather than a traditional ML model:
Input: "Your app keeps crashing whenever I try to upload photos"
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LLM Classification (temperature=0.0 for determinism)β
β β
β System Prompt: taxonomy definition + JSON schema β
β User Message: the customer's text β
β Context: last 4 messages (optional) β
ββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββ
β
Output JSON: βΌ
{
"primary_intent": "technical",
"primary_confidence": 0.92,
"secondary_intents": [
{"intent": "escalation", "confidence": 0.35}
],
"reasoning": "User reports app crash during photo upload - technical issue"
}
Why LLM-based instead of traditional ML?
| Factor | LLM Classification | Traditional ML (e.g. BERT) |
|---|---|---|
| Training data needed | Zero (zero-shot) | Hundreds-thousands of labeled examples |
| New intent support | Update the prompt | Retrain the model |
| Explainability | Built-in reasoning field | Requires separate explanation model |
| Latency | ~200-500ms | ~10-50ms |
| Cost | API call per message | One-time training cost |
The classification result drives routing to the appropriate prompt template (billing_support, technical_support, or general_support).
| Layer | Technology | Purpose |
|---|---|---|
| Framework | FastAPI | Async REST API with OpenAPI docs |
| LLM Provider | Anthropic Claude | Chat completions and classification |
| Prompt Management | YAML + Jinja2 | Version-controlled prompt templates |
| Session Storage | Redis | Conversation persistence with TTL |
| Fine-tuning | PEFT, LoRA, bitsandbytes | Parameter-efficient model adaptation |
| Training | HuggingFace Transformers, TRL | SFTTrainer with QLoRA support |
| Experiment Tracking | Weights & Biases | Training metrics and model comparison |
| Streaming | SSE-Starlette | Real-time token streaming |
| Database | PostgreSQL + SQLAlchemy | (Optional) structured data storage |
| Config | Pydantic Settings | Type-safe environment configuration |
| Logging | structlog | Structured JSON logging |
| Containerization | Docker (multi-stage) | Secure, slim production image |
02-customer-support-chatbot/
βββ src/customer_support/
β βββ __init__.py
β βββ main.py # Uvicorn entry point
β βββ api.py # FastAPI app: chat, classify, prompts, conversations
β βββ config.py # Settings (env vars, model config, Redis URL)
β βββ prompts.py # SystemPromptBuilder, PromptRegistry, FewShotManager, CoT
β βββ classifier.py # IntentClassifier with LLM-based multi-label classification
β βββ conversation.py # ConversationManager, state machine, sentiment tracking
β βββ finetuning.py # LoRA training pipeline (DatasetPreparator, FineTuningPipeline)
βββ data/templates/
β βββ billing_support.yaml
β βββ technical_support.yaml
β βββ general_support.yaml
βββ tests/
β βββ conftest.py
β βββ test_api.py
β βββ test_prompts.py
βββ k8s/
β βββ deployment.yaml
βββ Dockerfile
βββ pyproject.toml
βββ README.md
- Fork the repository
- Create a feature branch:
git checkout -b feature/my-feature - Install dev dependencies:
pip install -e ".[dev]" - Run tests:
pytest tests/ -v - For fine-tuning work, install training extras:
pip install -e ".[training]" - Submit a pull request
This project is part of the AI Engineer Portfolio and is licensed under the MIT License.