A production-grade fine-tuning pipeline for medical reasoning using SFT, QAT, and LLM-as-a-Judge evaluation
This project fine-tunes a 20B parameter language model on curated medical reasoning datasets to create a clinical assistant capable of chain-of-thought (CoT) reasoning. The pipeline includes:
- Dataset Preparation — Merging and formatting multiple medical QA datasets
- Supervised Fine-Tuning (SFT) — LoRA-based fine-tuning with Unsloth
- Quantization-Aware Training (QAT) — INT4 weight quantization for efficient deployment
- GGUF Export — Quantized model export for inference engines (llama.cpp, Ollama)
- LLM-as-a-Judge Evaluation — Blind comparative evaluation against baseline and proprietary models
┌─────────────────────────────────────────────────────────────────────────────┐
│ Training Pipeline │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │
│ │ Medical │ │ SFT │ │ QAT │ │ GGUF │ │
│ │ Datasets │───▶│ Fine-Tune │───▶│ Refinement │───▶│ Export │ │
│ │ (3 sources) │ │ (LoRA) │ │ (INT4) │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └───────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ Evaluation Pipeline │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │
│ │ Test Set │ │ Generate │ │ LLM Judge │ │ Win │ │
│ │ │───▶│ Responses │───▶│ (GPT-5.2) │───▶│ Rates │ │
│ │ │ │ (4 models) │ │ Blind Eval │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └───────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
| Dataset | Source | Description |
|---|---|---|
| medical-o1-reasoning-SFT | FreedomIntelligence | Complex chain-of-thought medical reasoning |
| Medical-R1-Distill-Data | FreedomIntelligence | Distilled medical reasoning data |
| MedReason | UCSC-VLAA | Medical QA with structured reasoning |
All datasets are normalized to a unified schema:
instruction → reasoning → output
The reasoning component is wrapped in <think>...</think> tags following the model's native reasoning format.
medical_assistant/
├── configs/
│ ├── gpt_oss_20b_sft.yaml # SFT configuration
│ └── gpt_oss_20b_qat.yaml # QAT configuration
├── data/
│ └── prepare_gpt_oss_sft_dataset.py
├── train/
│ └── sft_gpt_oss_20b_unsloth.py
├── scripts/
│ └── quantize.py
├── eval/
│ └── llm_judge_eval.py
├── utils/
│ └── seed.py
├── outputs/ # Model checkpoints & artifacts
├── .env # API keys (gitignored)
└── README.md
- GPU: AMD MI300X (ROCm 6.4.0) in DigitalOcean
- RAM: 128GB+ recommended
- Storage: 200GB+ for model checkpoints
- Ubuntu 24.04 LTS
- Python 3.10+
- ROCm 6.4.0
sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential git wget curl vim tmux htop nvtop
sudo apt-get install -y python3-venvpython3 -m venv .venv
source .venv/bin/activatepip install --upgrade torch==2.8.0 pytorch-triton-rocm torchvision torchaudio torchao==0.13.0 xformers \
--index-url https://download.pytorch.org/whl/rocm6.4
pip install --no-deps unsloth unsloth-zoo
pip install --no-deps git+https://github.com/unslothai/unsloth-zoo.git
pip install "unsloth[amd] @ git+https://github.com/unslothai/unsloth"pip install -U transformers accelerate datasets trl peft wandb evaluate omegaconf python-dotenv rich safetensors sentencepieceNote: If dependency conflicts occur:
pip install trl==0.24.0 datasets==4.3.11 msgspec cut_cross_entropy
Create a .env file in the project root:
WANDB_PROJECT=clinical-cot
WANDB_ENTITY=your-wandb-username
WANDB_API_KEY=your-wandb-api-key
HF_TOKEN=your-huggingface-token
OPENAI_API_KEY=your-openai-api-keyThe GGUF quantization step requires llama.cpp to be built locally. Unsloth uses it internally for model conversion.
# Install CMake if not present
sudo apt-get install -y cmake
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build with CMake (disable CURL if no internet access)
mkdir -p build
cd build
cmake .. -DLLAMA_CURL=OFF
cmake --build . --config Release -j$(nproc)
# Copy the quantizer binary to where Unsloth expects it
cp ~/workspace/llama.cpp/build/bin/llama-quantize ~/workspace/llama.cpp/
# Convert the Hugging Face model to unquantized GGUF
python3 convert_hf_to_gguf.py ../outputs/gpt_oss_20b/merged/gpt-oss-20b_clinical-cot_qat_refined --outfile ../outputs/gpt_oss_20b/gguf/model-f16.gguf --outtype f16
# Quantize to my desired methods:
build/bin/llama-quantize ../outputs/gpt_oss_20b/gguf/model-f16.gguf ../outputs/gpt_oss_20b/gguf/model-q4_k_m.gguf q4_k_m
build/bin/llama-quantize ../outputs/gpt_oss_20b/gguf/model-f16.gguf ../outputs/gpt_oss_20b/gguf/model-q5_k_m.gguf q5_k_m
build/bin/llama-quantize ../outputs/gpt_oss_20b/gguf/model-f16.gguf ../outputs/gpt_oss_20b/gguf/model-q8_0.gguf q8_0Note: The
llama.cppproject no longer supports the legacymakebuild system. You must use CMake.
Troubleshooting: If you encounter "No working quantizer found" errors, ensure
llama-quantizeexists in thellama.cpp/directory (not inllama.cpp/build/bin/).
| Quantization Method | Original Size (MiB) | Quantized Size (MiB) | Compression Ratio | Space Saved (MiB) |
|---|---|---|---|---|
| q4_k_m | 39909.25 | 15060.55 | 37.7% | 24848.70 |
| q5_k_m | 39909.25 | 16098.07 | 40.4% | 23811.18 |
| q8_0 | 39909.25 | 21218.21 | 53.2% | 18691.04 |
python data/prepare_gpt_oss_sft_dataset.pyThis script:
- Downloads and merges the three medical datasets
- Normalizes schema to
instruction,reasoning,output - Applies chat template formatting with
<think>tags - Creates train/eval splits (95%/5%)
- Saves processed dataset to
data/gpt_oss/sft_dataset/
python train/sft_gpt_oss_20b_unsloth.py --config configs/gpt_oss_20b_sft.yamlKey Configuration (configs/gpt_oss_20b_sft.yaml):
| Parameter | Value | Description |
|---|---|---|
model.base_model_name |
unsloth/gpt-oss-20b-BF16 |
Base model |
lora.r |
32 | LoRA rank |
lora.lora_alpha |
64 | LoRA scaling factor |
train.learning_rate |
1e-4 | Peak learning rate |
train.num_train_epochs |
2 | Training epochs |
train.packing |
true | Sequence packing for efficiency |
Outputs:
- LoRA adapter:
outputs/gpt_oss_20b/runs/<run_name>/lora_adapter/ - Merged model:
outputs/gpt_oss_20b/merged/<run_name>/
python train/sft_gpt_oss_20b_unsloth.py --config configs/gpt_oss_20b_qat.yamlKey Configuration (configs/gpt_oss_20b_qat.yaml):
| Parameter | Value | Description |
|---|---|---|
model.base_model_name |
outputs/.../sft_base |
SFT checkpoint |
qat.enabled |
true | Enable QAT mode |
qat.qat_scheme |
int4_weight_only |
Quantization scheme |
qat.learning_rate |
5e-5 | Lower LR for refinement |
train.num_train_epochs |
1 | QAT converges quickly |
python scripts/quantize.pyExports the QAT model to multiple GGUF quantization formats:
q4_k_m— 4-bit (recommended for deployment)q5_k_m— 5-bit (balanced quality/size)q8_0— 8-bit (highest quality)
Output: outputs/gpt_oss_20b/gguf/
python eval/llm_judge_eval.pyEvaluation Protocol:
- Sample 50 questions from the held-out eval set
- Generate responses from 4 models:
- Fine-tuned BF16 model
- QAT model
- Base model (unsloth/gpt-oss-20b-BF16)
- GPT-4.1 (proprietary baseline)
- Anonymize and shuffle responses
- GPT-5.2 ranks all 4 responses per question (with respect to the ground truth)
- Aggregate win rates
Output: outputs/evaluation_results.json
Win Rates (out of 50 prompts)
| Model | Wins | Win Rate |
|---|---|---|
| Base GPT-OSS (Unfinetuned) | 1 | 2.0% |
| GPT-4.1 (OpenAI API) | 0 | 0.0% |
| QAT Fine-tuned GPT-OSS | 14 | 28.0% |
| Fine-tuned GPT-OSS (BF16) | 35 | 70.0% |
Average Overall Scores (1–10, GPT-5.2 judge)
| Model | Score |
|---|---|
| Base GPT-OSS (Unfinetuned) | 7.12 |
| GPT-4.1 (OpenAI API) | 7.37 |
| QAT Fine-tuned GPT-OSS | 8.32 |
| Fine-tuned GPT-OSS (BF16) | 9.16 |
Latency (per question, approx.)
| Model | Time (s) |
|---|---|
| Base GPT-OSS (Unfinetuned) | 14.3 |
| Fine-tuned GPT-OSS (BF16) | 14.7 |
| QAT Fine-tuned GPT-OSS | 13.9 |
| GPT-4.1 (OpenAI API) | 5.8 |
These results are from a 50-question sample and a single GPT-5.2 judge; they should be treated as preliminary.
run:
seed: 42 # Reproducibility seed
output_dir: outputs/... # Checkpoint directory
run_name: ... # W&B run name
data:
input_disk_path: ... # Pre-merged dataset (optional)
output_disk_path: ... # Processed dataset output
eval_size: 0.05 # Eval split ratio
max_seq_length: 4096 # Maximum sequence length
reasoning_effort: medium # Reasoning verbosity hint
model:
base_model_name: ... # HuggingFace model ID or local path
load_in_4bit: false # 4-bit loading (inference only)
dtype: bf16 # Model dtype
lora:
r: 32 # LoRA rank
lora_alpha: 64 # LoRA alpha
lora_dropout: 0.05 # Dropout rate
target_modules: [...] # Modules to adapt
qat:
enabled: false # Enable QAT mode
qat_scheme: int4_weight_only
learning_rate: 5.0e-5
train:
per_device_train_batch_size: 8
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 2
warmup_ratio: 0.03
lr_scheduler_type: cosine
packing: true
report_to: wandbAll training runs are logged to Weights & Biases:
- Loss curves (train/eval)
- Learning rate schedule
- Gradient norms
- Hyperparameters
- System metrics (GPU utilization, memory)
| Artifact | Path | Description |
|---|---|---|
| SFT Merged Model | outputs/gpt_oss_20b/merged/..._sft_base/ |
Full merged BF16 model |
| QAT Merged Model | outputs/gpt_oss_20b/merged/..._qat/ |
QAT-refined model |
| GGUF Quantized | outputs/gpt_oss_20b/gguf/ |
Deployment-ready quantized models |
| Eval Results | outputs/evaluation_results.json |
LLM judge rankings |
ollama create clinical-cot -f outputs/gpt_oss_20b/gguf/model-q4_k_m.gguf
ollama run clinical-cot./main -m outputs/gpt_oss_20b/gguf/model-q4_k_m.gguf \
-p "What are the differential diagnoses for chest pain?" \
-n 512This project includes a production-ready inference server and a Streamlit-based web interface for interacting with the fine-tuned model.
┌─────────────────────────────────────────────────────────────────────────────┐
│ Inference Stack │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Streamlit │ │ FastAPI │ │ llama.cpp │ │
│ │ Web UI │───▶│ Server │───▶│ Backend │ │
│ │ (app.py) │ │ (server.py) │ │ (GGUF Model)│ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ :8501 :8000 │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
If you use this pipeline, please cite the source datasets:
@misc{medical-o1-reasoning,
title={Medical-O1-Reasoning-SFT},
author={FreedomIntelligence},
year={2024},
publisher={HuggingFace}
}
@misc{medical-r1-distill,
title={Medical-R1-Distill-Data},
author={FreedomIntelligence},
year={2024},
publisher={HuggingFace}
}
@misc{medreason,
title={MedReason},
author={UCSC-VLAA},
year={2024},
publisher={HuggingFace}
}This project is for research and educational purposes. Please ensure compliance with:
- Base model license terms
- Dataset licenses
- OpenAI API terms of service (for evaluation)
Built with Unsloth 🦥 + Hugging Face 🤗
