⭐ If you find this project helpful, please consider giving it a star on GitHub!
Khai Le-Duc* 1,2✉, Duy M. H. Nguyen* 3,4,24✉, Phuong T. H. Trinh* 5, Tien-Phat Nguyen* 6, Nghiem T. Diep** 3, An Ngo** 7, Tung Vu** 8, Trinh Vuong9, Anh-Tien Nguyen10,11, Mau Nguyen12, Van Trung Hoang13, Khai-Nguyen Nguyen14, Hy Nguyen15, Chris Ngo2, Anji Liu16, Nhat Ho17, Anne-Christin Hauschild11, Khanh Xuan Nguyen18, Thanh Nguyen-Tang19, Pengtao Xie20,21, Daniel Sonntag3,22, James Zou23, Mathias Niepert4,24, Anh Totti Nguyen25✉
*Co-first authors; order randomized | **Co-second authors
✉ Corresponding Authors
🎓 Affiliations (click to expand)
- University of Toronto, Canada
- Knovel Engineering Lab, Singapore
- German Research Centre for Artificial Intelligence
- University of Stuttgart, Germany
- Chonnam National University, South Korea
- Singapore University of Technology and Design
- Bucknell University, USA
- Concordia University, Canada
- Korea University
- Justus Liebig University Giessen, Germany
- University Medical Center Göttingen, Germany
- Japan Advanced Institute of Science and Technology
- Hue University, Vietnam
- College of William & Mary, USA
- Deakin University, Australia
- National University of Singapore
- University of Texas at Austin, USA
- University of California, Berkeley, USA
- New Jersey Institute of Technology, USA
- University of California San Diego, USA
- MBZUAI, UAE
- Oldenburg University, Germany
- Stanford University, USA
- Max Planck Research School for Intelligent Systems (IMPRS-IS), Germany
- Auburn University, USA
✨ In honor of Hải Thượng Lãn Ông (海上懶翁) – Lê Hữu Trác (黎友晫), the father of Vietnamese traditional medicine ✨
📚 Table of Contents (click to expand)
S-Chain is the first large-scale dataset of Structured Visual Chain-of-Thought (SV-CoT): each reasoning step is explicitly linked to visual evidence via bounding boxes. This enables training and evaluating grounded medical VLM reasoning instead of hallucinated justifications.
- 12,000 medical images with expert bounding boxes.
- 700k+ VQA / rationale pairs across 16 languages.
- Each sample: image, question, answer, stepwise SV-CoT, and per-step visual regions.
We show that supervising VLMs with SV-CoT:
- Improves interpretability
- Improves grounding fidelity (reasoning actually points to the right region)
- Improves robustness across models and languages
- [Oct 2025] Released experiment scripts and checkpoints for two state-of-the-art medical MLLMs with ExGra-Med and LLaVA-Med.
- [Oct 2025] Dataset and project site released.
architectures/— adapters for each backbone (ExGra-Med, LLaVA-Med, InternVL, MedGemma, ...). Each model has its own installation and usage instructions.medrag_integration/— Retrieval-Augmented Generation (RAG) setup for medical evidence.data/— dataset download scripts and directory conventions.
Example Usage (Python) from Hugging Face
👉 https://huggingface.co/datasets/leduckhai/S-Chain
from datasets import load_dataset
dataset = load_dataset("leduckhai/S-Chain")
print(dataset)Or using Bash
cd data
bash download_english.sh # English-only SV-CoT split
bash download_multilingual.sh # All 16 languages
This will populate:
data/
s_chain_en/
train.jsonl
val.jsonl
test.jsonl
images/
annotations/
s_chain_multilingual/
...
Each *.jsonl record contains:
{
"image_path": "images/img_000123.png",
"question": "...",
"answer": "...",
"sv_cot": [
{
"step_text": "First, identify the left costophrenic angle...",
"evidence_bbox": [x, y, w, h]
},
{
"step_text": "Blunting indicates pleural effusion...",
"evidence_bbox": [x, y, w, h]
}
],
"language": "en"
}
| Model | Description | 🤗 Download Link |
|---|---|---|
llava-med-base |
LLaVa-Med trained with base settings (Q4 only) | Link |
llava-med-gpt-cot |
LLaVa-Med trained with GPT-synthetic visual COT | Link |
llava-med-gpt-schain |
LLaVa-Med trained with our S-Chain dataset | Link |
llava-med-gpt-medrag-only |
LLaVa-Med with medical retrieval augmented generation and Q4 only | Link |
llava-med-gpt-medrag-schain |
LLaVa-Med with medical retrieval augmented generation and S-Chian | Link |
exgra-med-base |
ExGra-Med trained with base settings (Q4 only) | Link |
exgra-med-gpt-cot |
ExGra-Med trained with GPT-synthetic visual COT | Link |
exgra-med-gpt-schain |
ExGra-Med trained with our S-Chain dataset | Link |
exgra-med-gpt-medrag-only |
ExGra-Med with medical retrieval augmented generation and Q4 only | Link |
exgra-med-gpt-medrag-schain |
ExGra-Med with medical retrieval augmented generation and S-Chian | Link |
Before starting the finetuning/inference/evaluation, download our finetuned checkpoints, e.g., download folder exgra-med-gpt-schain at this link and put it inside architectures/Exgra-Med/checkpoints
Below: load ExGra-Med fine-tuned on SV-CoT from Hugging Face and generate answer and grounded rationale.
cd architectures/Exgra-Med-CoT
# Then, choosing one of two ways below:
bash bashscript/run_infer_demo.py
# or
python llava/eval/run_med_datasets_eval_batch_CoT.py \
--num-chunks 2 \
--conv-mode ${prompt_mode} \
--use_rag ${use_rag} \
--model-name ${output_dir} \
--mm_dense_connector_type none \
--num_l 6 \
--question-file ${test_file_json} \
--image-folder ${image_folder} \
--answers-file ${answers_file}
python llava/eval/run_eval_CoT.py \
--gt ${test_file_json} \
--pred ${answers_file} \
`
Outputs will include (a) predicted answer, (b) stepwise visual chain-of-thought, and (c) bounding boxes per step (saved overlay in outputs/viz/).
We evaluate the following training regimes for each backbone:
-
Baseline CoT: Supervise on model with input image, question and final prediction (Q4).
-
GPT-Synthetic CoT: Supervise on GPT-based synthetic visual chain-of-thought.
-
SV-CoT (Ours): Supervise on our Structured Visual CoT, where each step links to image regions.
-
Medical RAG-only Fine-tune with medical Retrieval-Augmented Generation context. We follow the techniques by MIRIAD to generate addtional context in input promots and train the models without our SV-CoT supervision.
-
SV-CoT + RAG (Joint): Fine-tune using both: visual step grounding from S-Chain and retrieved evidence from MIRIAD.
To train a provided model with any settings, first you need to move into the corresponding folder in ./architectures and follow the README carefully.
1. ExGra-Med & LLaVA-Med
To train:
cd architectures/Exgra-Med-CoT
bash bashscript/llava1-5_stage2_noval_CoT.shTo evaluate:
cd architectures/Exgra-Med-CoT
python llava/eval/run_med_datasets_eval_batch_CoT.py \
--num-chunks 2 \
--conv-mode ${prompt_mode} \
--use_rag ${use_rag} \
--model-name ${output_dir} \
--mm_dense_connector_type none \
--num_l 6 \
--question-file ${test_file_json} \
--image-folder ${image_folder} \
--answers-file ${answers_file}
python llava/eval/run_eval_CoT.py \
--gt ${test_file_json} \
--pred ${answers_file} \Please find more details in Exgra-Med & LLaVA-Med.
More models coming soon...
If you find this work useful, please cite our paper: https://arxiv.org/abs/2510.22728
@article{leduc2025schain,
title={S-Chain: Structured Visual Chain-of-Thought For Medicine},
author={Le-Duc, Khai and Trinh, Phuong T. H. and Nguyen, Duy M. H. and Nguyen, Tien-Phat and Diep, Nghiem T. and Ngo, An and Vu, Tung and Vuong, Trinh and Nguyen, Anh-Tien and Nguyen, Mau and Hoang, Van Trung and Nguyen, Khai-Nguyen and Nguyen, Hy and Ngo, Chris and Liu, Anji and Ho, Nhat and Hauschild, Anne-Christin and Nguyen, Khanh Xuan and Nguyen-Tang, Thanh and Xie, Pengtao and Sonntag, Daniel and Zou, James and Niepert, Mathias and Nguyen, Anh Totti},
journal={arXiv preprint},
eprint={2510.22728},
url={https://arxiv.org/abs/2510.22728},
year={2025}
}
The S-Chain dataset is provided solely for research and educational purposes. It may contain human or machine annotation errors, as well as potential biases or inconsistencies inherent to medical data. Users are expected to exercise appropriate caution in interpretation and ensure ethical and non-commercial use.
