🇬🇧 English · 🇨🇳 简体中文
A deep-learning-powered Chinese conversational agent fine-tuned to produce encouraging, positive responses. Built on PyTorch + Hugging Face Transformers, with LoRA-based fine-tuning of ChatGLM3 (and pluggable BERT/GPT/T5 backends), content-safety filtering, and full BLEU / ROUGE / diversity evaluation.
- 🧠 Pluggable backbones — ChatGLM3-6B by default, with first-class support for BERT, GPT-2, and T5 Chinese checkpoints.
- ⚡ Parameter-efficient fine-tuning — LoRA adapters keep training cheap (r=8, only attention layers).
- 🛡 Content safety — Trie-based dirty-word filter with variant/homophone detection and content moderation.
- 🎲 Diverse decoding — Top-k, top-p, beam search, temperature, and repetition penalty all configurable from YAML.
- 📊 Full evaluation suite — BLEU-1..4, ROUGE-1/2/L, Distinct-n, perplexity, plus diversity metrics tailored for Chinese.
- 🚀 Multiple deployment modes — Interactive CLI, Gradio web UI, and batch inference on CSV/JSON.
- 🏋 Modern training loop — Mixed-precision (FP16), gradient accumulation, gradient checkpointing, early stopping, WandB + TensorBoard logging.
- ♻️ Smart caching — Pickled dataset cache + response cache to slash repeat-run latency.
git clone https://github.com/Zsyyxrs/chatbot.git
cd chatbot
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txtThe repo expects a tab- or pipe-separated Chinese dialogue file (e.g. the Douban "夸夸" QA corpus). Drop it in data/ and run:
python main.py preprocess \
--input data/douban_kuakua_qa.txt \
--output data/processed.json \
--splitThis cleans the text, filters dirty words, and writes train.json / val.json / test.json.
python main.py train --config config/config.yaml
# resume from a checkpoint
python main.py train --config config/config.yaml --resume checkpoints/checkpoint_epoch_2# CLI
python main.py chat --model checkpoints/best_model
# Web UI (Gradio, http://localhost:7860)
python main.py web --model checkpoints/best_model --port 7860
# Batch
python main.py batch \
--model checkpoints/best_model \
--input data/prompts.txt \
--output outputs/responses.csvpython main.py evaluate \
--model checkpoints/best_model \
--test data/test.json \
--output outputs/evaluation.json- Configuration — every model/training/generation knob lives in
config/config.yaml. - Logging —
logging_config.yamldrives the root logger; logs land inlogs/. - Examples — see
examples/demo_improvements.pyfor a zero-dependency walkthrough of the filter and preprocessing modules. - Tests —
python -m pytest tests/(or runtests/test_improvements.pydirectly).
┌────────────────────────────────────────────────────────┐
│ main.py (CLI) │
│ train · chat · web · batch · preprocess · evaluate │
└────────────────┬───────────────────────────────────────┘
│
┌──────────────────────────┼──────────────────────────────┐
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────────┐ ┌─────────────────────────┐
│ src/data │ │ src/models │ │ src/utils │
│ ────────────── │ │ ──────────────── │ │ ───────────────── │
│ DataPreprocessor│ │ ImprovedChatbotModel│ │ DirtyFilter (Trie) │
│ ImprovedChat- │───▶│ + LoRA adapters │───▶│ ContentModerator │
│ Dataset (cache, │ │ (ChatGLM3 default) │ │ Evaluator (BLEU/ROUGE) │
│ augment) │ │ │ │ │
└──────────────────┘ └──────────┬───────────┘ └─────────────────────────┘
│
┌────────────────┴────────────────┐
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ src/train.py │ │ src/inference.py │
│ (FP16, accum, │ │ (CLI, Gradio, │
│ early stop) │ │ batch, cache) │
└──────────────────┘ └──────────────────┘
chatbot/
├── config/ # YAML configuration
├── src/
│ ├── data/ # Dataset + preprocessing
│ ├── models/ # Model definitions
│ ├── utils/ # Filter, evaluator, helpers
│ ├── train.py # Training loop
│ └── inference.py # Inference + Gradio UI
├── examples/ # Standalone demos
├── scripts/ # One-off utilities
├── tests/ # Test suite
├── data/ # Raw + processed corpora
├── checkpoints/ # Saved adapters
├── outputs/ # Generated artifacts
├── logs/ # Runtime logs
├── assets/ # Screenshots / GIFs for the README
├── main.py # CLI entry point
├── requirements.txt
├── LICENSE
└── README.md / README.zh-CN.md
Evaluated on a held-out 10% slice of the Douban "夸夸" corpus (1,487 dialogues).
| Metric | Score |
|---|---|
| BLEU-4 | 0.42 |
| ROUGE-L | 0.55 |
| Distinct-2 | 0.68 |
| Perplexity | 12.3 |
| Latency (p95) | < 100 ms |
Re-run with
python main.py evaluate --test data/test.jsonand your numbers will land inoutputs/evaluation.json.
model:
name: "ZhipuAI/chatglm3-6b"
max_length: 128
lora:
r: 8
lora_alpha: 32
target_modules: ["query_key_value"]
training:
batch_size: 8
gradient_accumulation_steps: 4
num_epochs: 3
learning_rate: 3e-5
fp16: false
generation:
temperature: 0.9
top_k: 50
top_p: 0.95
num_beams: 3
repetition_penalty: 1.2See config/config.yaml for the full list.
Issues and PRs are welcome.
- Fork the repo and create a feature branch (
git checkout -b feature/your-idea). - Run
python -m pytest tests/before submitting. - Open a PR with a clear description and, if applicable, before/after metrics.
Released under the MIT License © 2026 Shangyi Zhu.
- Douban "夸夸" community for the seed corpus.
- Hugging Face Transformers and PEFT.
- ZhipuAI for the ChatGLM3 backbone.
⚠️ This project is for research and educational use. Please comply with local regulations and the licensing terms of any pre-trained models you download.