Figure 1 | MedVAL test-time workflow. A generator LM produces an output, and MedVAL assesses the output's factual consistency with the input, while assigning a risk grade and determining its safety for deployment.
MedVAL is a self-supervised framework for expert-level validation of AI-generated medical text using language models. The system is designed to evaluate the accuracy and safety of AI-generated medical text across multiple medical tasks. The framework supports both model fine-tuning and evaluation.
Create and activate the conda environment:
conda env create -f env.yml
conda activate medvalpython run.py --config=testFor evaluating API-based models (OpenAI, Anthropic, Gemini, etc.):
Configuration (configs/test.yaml):
tasks: [dialogue2note, medication2answer, query2question, report2impression]
data: test
method: zero-shot # [zero-shot, finetune]
n_samples: null
debug: False
input_csv: null # Optional: Path to custom CSV file
model: openai/gpt-4o-mini
api_base: null
api_key: ${API_KEY}
local_model_path: nullFor evaluating local or HuggingFace models:
Configuration (configs/test.yaml):
tasks: [dialogue2note, medication2answer, query2question, report2impression]
data: test
method: zero-shot # [zero-shot, finetune]
n_samples: null
debug: False
input_csv: null # Optional: Path to custom CSV file
model: local/MODEL_NAME
api_base: null
api_key: null
local_model_path: /path/to/local/modelpython run.py --config=trainFor fine-tuning a local student model using an API-based teacher model:
Configuration (configs/train.yaml):
tasks: [medication2answer, query2question, report2impression, report2simplified]
data: train
method: finetune
n_samples: null
debug: False
num_threads: 16
num_epochs: 5
threshold: 0.95
model: openai/gpt-4o-mini
api_base: null
api_key: ${API_KEY}
student_model: local/STUDENT_MODEL_NAME
local_model_path: /path/to/student/modelFor fine-tuning a local student model using a local teacher model:
Configuration (configs/train.yaml):
tasks: [medication2answer, query2question, report2impression, report2simplified]
data: train
method: finetune
n_samples: null
debug: False
num_threads: 16
num_epochs: 5
threshold: 0.95
model: local/MODEL_NAME
api_base: null
api_key: null
student_model: local/MODEL_NAME
local_model_path: /path/to/local/modelmodel: openai/MODEL_NAME
api_base: null
api_key: ${OPENAI_API_KEY}model: gemini/MODEL_NAME
api_base: null
api_key: ${GEMINI_API_KEY}model: anthropic/MODEL_NAME
api_base: null
api_key: ${ANTHROPIC_API_KEY}model: openai/HUGGINGFACE_MODEL_NAME
api_base: http://SERVER_IP:PORT/v1
api_key: localmodel: ollama_chat/MODEL_NAME
api_base: http://SERVER_IP:PORT
api_key: null-
Dataset Loading:
- By default, the MedVAL-Bench dataset is automatically loaded from HuggingFace:
load_dataset("stanfordmimi/MedVAL-Bench"). - To use a custom CSV file, specify path in
configs/test.yaml:input_csv: /path/to/csv(ensure custom CSV has similar column structure to the HuggingFace dataset).
- By default, the MedVAL-Bench dataset is automatically loaded from HuggingFace:
-
MedVAL-4B Model
- MedVAL-4B can be downloaded from HuggingFace (
stanfordmimi/MedVAL-4B). Once downloaded, run evaluation with MedVAL-4B by settinglocal_model_path: /path/to/medval-4bin the config.
- MedVAL-4B can be downloaded from HuggingFace (
tasks: List of tasks for fine-tuning/evaluationdata: Dataset split (trainortest)method: Evaluation method (zero-shotorfinetune)n_samples: Number of samples to process (null for all)debug: Enable debug mode for detailed output
model: Model identifier (API or local)api_base: API endpoint URLapi_key: API key (use${ENV_VAR}for environment variables)local_model_path: Path to local model files
student_model: Student model for fine-tuningnum_threads: Number of threads for trainingnum_epochs: Training epochsthreshold: Filtering threshold
Results are automatically saved to the results/ directory with the following structure:
results/
βββ zero-shot/
β βββ model_name/
β βββ dataset_name.csv
βββ finetune/
βββ model_name/
βββ dataset_name.csv
MedVAL/
βββ configs/ # Configuration files
βββ medval/ # Core package
β βββ pipeline.py # Main MedVAL pipeline
β βββ generator.py # Text generation module
β βββ validator.py # Validation module
βββ utils/ # Utility functions and prompts
βββ agents/ # Fine-tuned model storage
βββ results/ # Evaluation results
βββ run.py # Main execution script
We welcome contributions to improve MedVAL! Please feel free to submit issues, feature requests, or pull requests.
This repository is built using DSPy for language model fine-tuning/evaluation.
If you find this repository useful for your work, please cite the following paper:
@article{aali2025medval,
title={MedVAL: Toward Expert-Level Medical Text Validation with Language Models},
author={Asad Aali and Vasiliki Bikia and Maya Varma and Nicole Chiou and Sophie Ostmeier and Arnav Singhvi and Magdalini Paschali and Ashwin Kumar and Andrew Johnston and Karimar Amador-Martinez and Eduardo Juan Perez Guerrero and Paola Naovi Cruz Rivera and Sergios Gatidis and Christian Bluethgen and Eduardo Pontes Reis and Eddy D. Zandee van Rilland and Poonam Laxmappa Hosamani and Kevin R Keet and Minjoung Go and Evelyn Ling and David B. Larson and Curtis Langlotz and Roxana Daneshjou and Jason Hom and Sanmi Koyejo and Emily Alsentzer and Akshay S. Chaudhari},
journal={arXiv preprint arXiv:2507.03152},
year={2025}
}