MedVAL: Toward Expert-Level Medical Text Validation with Language Models

Figure 1 | MedVAL test-time workflow. A generator LM produces an output, and MedVAL assesses the output's factual consistency with the input, while assigning a risk grade and determining its safety for deployment.

🏥 What is MedVAL?

MedVAL is a self-supervised framework for expert-level validation of AI-generated medical text using language models. The system is designed to evaluate the accuracy and safety of AI-generated medical text across multiple medical tasks. The framework supports both model fine-tuning and evaluation.

⚡️ Installation

Environment Setup

Create and activate the conda environment:

conda env create -f env.yml
conda activate medval

🚀 Evaluation Instructions

python run.py --config=test

1. API-based Models

For evaluating API-based models (OpenAI, Anthropic, Gemini, etc.):

Configuration (configs/test.yaml):

tasks: [dialogue2note, medication2answer, query2question, report2impression]
data: test
method: zero-shot # [zero-shot, finetune]

n_samples: null
debug: False
input_csv: null  # Optional: Path to custom CSV file

model: openai/gpt-4o-mini
api_base: null
api_key: ${API_KEY}
local_model_path: null

2. Local/Huggingface Models

For evaluating local or HuggingFace models:

Configuration (configs/test.yaml):

tasks: [dialogue2note, medication2answer, query2question, report2impression]
data: test
method: zero-shot # [zero-shot, finetune]

n_samples: null
debug: False
input_csv: null  # Optional: Path to custom CSV file

model: local/MODEL_NAME
api_base: null
api_key: null
local_model_path: /path/to/local/model

🔥 Fine-Tuning Instructions

python run.py --config=train

1. API-based Teacher Models

For fine-tuning a local student model using an API-based teacher model:

Configuration (configs/train.yaml):

tasks: [medication2answer, query2question, report2impression, report2simplified]
data: train
method: finetune

n_samples: null
debug: False
num_threads: 16
num_epochs: 5
threshold: 0.95

model: openai/gpt-4o-mini
api_base: null
api_key: ${API_KEY}

student_model: local/STUDENT_MODEL_NAME
local_model_path: /path/to/student/model

2. Local/Huggingface Models

For fine-tuning a local student model using a local teacher model:

Configuration (configs/train.yaml):

tasks: [medication2answer, query2question, report2impression, report2simplified]
data: train
method: finetune

n_samples: null
debug: False
num_threads: 16
num_epochs: 5
threshold: 0.95

model: local/MODEL_NAME
api_base: null
api_key: null

student_model: local/MODEL_NAME
local_model_path: /path/to/local/model

🔧 API Model Configurations

OpenAI

model: openai/MODEL_NAME
api_base: null
api_key: ${OPENAI_API_KEY}

Gemini

model: gemini/MODEL_NAME
api_base: null
api_key: ${GEMINI_API_KEY}

Anthropic

model: anthropic/MODEL_NAME
api_base: null
api_key: ${ANTHROPIC_API_KEY}

SGLang

model: openai/HUGGINGFACE_MODEL_NAME
api_base: http://SERVER_IP:PORT/v1
api_key: local

Ollama

model: ollama_chat/MODEL_NAME
api_base: http://SERVER_IP:PORT
api_key: null

📊 Dataset and Fine-Tuned Model

Dataset Loading:
- By default, the MedVAL-Bench dataset is automatically loaded from HuggingFace: load_dataset("stanfordmimi/MedVAL-Bench").
- To use a custom CSV file, specify path in configs/test.yaml: input_csv: /path/to/csv (ensure custom CSV has similar column structure to the HuggingFace dataset).
MedVAL-4B Model
- MedVAL-4B can be downloaded from HuggingFace (stanfordmimi/MedVAL-4B). Once downloaded, run evaluation with MedVAL-4B by setting local_model_path: /path/to/medval-4b in the config.

🎯 Configuration Parameters

Core Parameters

tasks: List of tasks for fine-tuning/evaluation
data: Dataset split (train or test)
method: Evaluation method (zero-shot or finetune)
n_samples: Number of samples to process (null for all)
debug: Enable debug mode for detailed output

Model Parameters

model: Model identifier (API or local)
api_base: API endpoint URL
api_key: API key (use ${ENV_VAR} for environment variables)
local_model_path: Path to local model files

Fine-tuning Parameters

student_model: Student model for fine-tuning
num_threads: Number of threads for training
num_epochs: Training epochs
threshold: Filtering threshold

📈 Results

Results are automatically saved to the results/ directory with the following structure:

results/
├── zero-shot/
│   └── model_name/
│       └── dataset_name.csv
└── finetune/
    └── model_name/
        └── dataset_name.csv

🏗️ Project Structure

MedVAL/
├── configs/          # Configuration files
├── medval/           # Core package
│   ├── pipeline.py   # Main MedVAL pipeline
│   ├── generator.py  # Text generation module
│   └── validator.py  # Validation module
├── utils/            # Utility functions and prompts
├── agents/           # Fine-tuned model storage
├── results/          # Evaluation results
└── run.py            # Main execution script

🤝 Contributing

We welcome contributions to improve MedVAL! Please feel free to submit issues, feature requests, or pull requests.

🙏 Acknowledgments

This repository is built using DSPy for language model fine-tuning/evaluation.

📎 Citation

If you find this repository useful for your work, please cite the following paper:

@article{aali2025medval,
  title={MedVAL: Toward Expert-Level Medical Text Validation with Language Models},
  author={Asad Aali and Vasiliki Bikia and Maya Varma and Nicole Chiou and Sophie Ostmeier and Arnav Singhvi and Magdalini Paschali and Ashwin Kumar and Andrew Johnston and Karimar Amador-Martinez and Eduardo Juan Perez Guerrero and Paola Naovi Cruz Rivera and Sergios Gatidis and Christian Bluethgen and Eduardo Pontes Reis and Eddy D. Zandee van Rilland and Poonam Laxmappa Hosamani and Kevin R Keet and Minjoung Go and Evelyn Ling and David B. Larson and Curtis Langlotz and Roxana Daneshjou and Jason Hom and Sanmi Koyejo and Emily Alsentzer and Akshay S. Chaudhari},
  journal={arXiv preprint arXiv:2507.03152},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
assets		assets
configs		configs
medval		medval
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MedVAL: Toward Expert-Level Medical Text Validation with Language Models

🏥 What is MedVAL?

⚡️ Installation

Environment Setup

🚀 Evaluation Instructions

1. API-based Models

2. Local/Huggingface Models

🔥 Fine-Tuning Instructions

1. API-based Teacher Models

2. Local/Huggingface Models

🔧 API Model Configurations

OpenAI

Gemini

Anthropic

SGLang

Ollama

📊 Dataset and Fine-Tuned Model

🎯 Configuration Parameters

Core Parameters

Model Parameters

Fine-tuning Parameters

📈 Results

🏗️ Project Structure

🤝 Contributing

🙏 Acknowledgments

📎 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

StanfordMIMI/MedVAL

Folders and files

Latest commit

History

Repository files navigation

MedVAL: Toward Expert-Level Medical Text Validation with Language Models

🏥 What is MedVAL?

⚡️ Installation

Environment Setup

🚀 Evaluation Instructions

1. API-based Models

2. Local/Huggingface Models

🔥 Fine-Tuning Instructions

1. API-based Teacher Models

2. Local/Huggingface Models

🔧 API Model Configurations

OpenAI

Gemini

Anthropic

SGLang

Ollama

📊 Dataset and Fine-Tuned Model

🎯 Configuration Parameters

Core Parameters

Model Parameters

Fine-tuning Parameters

📈 Results

🏗️ Project Structure

🤝 Contributing

🙏 Acknowledgments

📎 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages