TrialPanorama is a large-scale structured resource that aggregates 1.6M clinical trial records from fifteen global registries and links them with biomedical ontologies and associated literature. This repository provides a comprehensive benchmark framework for evaluating language models on clinical trial-related tasks, designed to assess model capabilities in understanding and reasoning about medical research.
- Systematic Review Tasks:
- Study Search: Finding relevant clinical trials given a systematic review setup
- Study Screening: Determining whether clinical trials should be included in a systematic review based on eligibility criteria
- Evidence Summary: Generating evidence summaries from clinical trial data
- Clinical Trial Design Tasks:
- Trial Completion Assessment: Predicting whether a clinical trial will complete successfully or terminate prematurely, including the reason for termination
- Arm Design: Designing appropriate trial arms
- Eligibility Criteria Design: Creating inclusion/exclusion criteria
- Endpoint Design: Defining primary and secondary outcomes
- Sample Size Estimation: Calculating appropriate sample sizes
π Paper: Developing Large Language Models for Clinical Research Using One Million Clinical Trials
We introduce TrialPanorama, a comprehensive resource for developing and evaluating AI systems for clinical research. The dataset includes:
- 1.6M clinical trial records from fifteen global registries
- Links to biomedical ontologies and associated literature
- 152K training and testing samples across eight clinical research tasks
π€ Dataset: TrialPanorama/Dataset
The dataset includes supervised fine-tuning data for:
- Systematic review workflows (study search, screening, evidence summarization)
- Trial design and optimization (arm design, eligibility criteria, endpoints, sample size estimation, completion assessment)
The benchmark framework requires Python 3.8+ and uses pipenv for dependency management.
# Install pipenv if you don't have it
pip install pipenv
# Install dependencies using pipenv
pipenv install
# Activate the virtual environment
pipenv shellAll benchmark scripts should be run within the pipenv virtual environment.
Create a .env file in the root directory with your API keys:
# Azure OpenAI
AZURE_OPENAI_API_KEY=your_azure_openai_api_key
AZURE_OPENAI_ENDPOINT=your_azure_openai_endpoint
# Azure OpenAI Deployment Configuration
AZURE_DEPLOYMENT_GPT4O=deployment_name_for_gpt4o
AZURE_DEPLOYMENT_GPT4O_MINI=deployment_name_for_gpt4o_mini
AZURE_DEPLOYMENT_O3_MINI=deployment_name_for_o3_mini
# PubMed API (optional, for study search)
PUBMED_API_KEY=your_pubmed_api_key
# Path Configuration
DATABASE_PATH=/path/to/benchmark_data
BENCHMARK_DATA_PATH=/path/to/benchmark_results
The .env file will be automatically loaded by pipenv when you activate the virtual environment.
Benchmark datasets should be structured as follows:
/path/to/benchmark_data/
βββ study_search/
β βββ train.jsonl
β βββ test.jsonl
βββ study_screening/
β βββ train.jsonl
β βββ test.jsonl
βββ trial_completion/
β βββ train.jsonl
β βββ test.jsonl
βββ design_arms_qa/
β βββ train.jsonl
β βββ test.jsonl
βββ design_criteria_qa/
β βββ train.jsonl
β βββ test.jsonl
βββ design_outcome_qa/
β βββ train.jsonl
β βββ test.jsonl
βββ evidence_summary_qa/
β βββ train.jsonl
β βββ test.jsonl
βββ sample_size_estimation/
βββ train.jsonl
βββ test.jsonl
Each dataset follows a specific format detailed in the task documentation.
Before running full benchmarks, you can verify your setup using the sanity check script:
# Make sure you're in the pipenv shell
pipenv shell
# Run with default settings (gpt-4o-mini and 2 samples per task)
./sanity_check.shThis script:
- Runs each benchmark with a minimal number of samples
- Reports success/failure for each task
- Logs detailed output for troubleshooting
- Creates a
logs/directory with results from each test
If all tests pass, your environment is correctly configured for running benchmarks.
Ensure you're in the pipenv environment before running any benchmark script:
pipenv shell# Run a single model
python benchmark_scripts/run_study_search.py --model-name gpt-4o
# Customize the system prompt
python benchmark_scripts/run_study_search.py --model-name gpt-4o --system-prompt "Your custom system prompt"
# Limit the number of samples
python benchmark_scripts/run_study_search.py --model-name gpt-4o --num-samples 10# Run a single model
python benchmark_scripts/run_study_screening.py --model-name gpt-4o
# Customize options
python benchmark_scripts/run_study_screening.py --model-name gpt-4o --system-prompt "Custom prompt" --num-samples 10# Run a single model
python benchmark_scripts/run_trial_completion.py --model-name gpt-4o
# Customize options
python benchmark_scripts/run_trial_completion.py --model-name gpt-4o --system-prompt "Custom prompt" --num-samples 10# Arm Design
python benchmark_scripts/run_arm_design.py --model-name gpt-4o
# Eligibility Criteria Design
python benchmark_scripts/run_eligibility_criteria_design.py --model-name gpt-4o
# Endpoint Design
python benchmark_scripts/run_endpoint_design.py --model-name gpt-4o
# Evidence Summary
python benchmark_scripts/run_evidence_summary.py --model-name gpt-4o
# Sample Size Estimation
python benchmark_scripts/run_sample_size_estimation.py --model-name gpt-4oFor each task, a shell script is provided to run benchmarks for multiple models in parallel. Make sure you're in the pipenv environment before running these scripts:
pipenv shellThen run the desired benchmark script:
# Study-related Tasks
./benchmark_study_search.sh
./benchmark_study_screening.sh
./benchmark_evidence_summary.sh
# Design-related Tasks
./benchmark_arm_design.sh
./benchmark_eligibility_criteria_design.sh
./benchmark_endpoint_design.sh
./benchmark_evidence_summary.sh
./benchmark_sample_size_estimation.sh
./benchmark_trial_completion.shThese scripts:
- Run benchmarks for three models (
gpt-4o-mini,gpt-4o, ando3-mini) - Use
nohupto ensure processes continue even if your terminal session closes - Log output to task-specific log files
- Return process IDs for monitoring
# View running benchmark processes
ps aux | grep run_
# Check log files for a specific task
tail -f logs/endpoint_design_gpt-4o.logResults are saved to the path specified in your .env file (BENCHMARK_DATA_PATH):
benchmark_results/
βββ study_search/
β βββ [model_name]/
β βββ [timestamp]/
β βββ results.json # Detailed results including predictions
β βββ metrics.json # Summary metrics only
βββ study_screening/
βββ trial_completion/
βββ design_arms_qa/
βββ design_criteria_qa/
βββ design_outcome_qa/
βββ evidence_summary_qa/
βββ sample_size_estimation/
Each task reports specific metrics:
precision,recall,f1: Standard classification metricsaccuracy: Overall accuracy across all studies
outcome_prediction.accuracy: Accuracy of predicting completion vs terminationtermination_type.accuracy: Accuracy of predicting the specific termination reason
accuracy: Proportion of questions answered correctlyf1_score: For questions with potentially multiple correct answers
The framework currently supports Azure OpenAI models (gpt-4o, gpt-4o-mini, o3-mini, etc.).
You can modify benchmark/models/ to add support for additional models.
Each task is implemented as a class that inherits from the base Task class.
Task classes handle:
- Input preparation
- Prediction parsing
- Evaluation against ground truth
- Results calculation
If you find this project useful, please cite our paper:
@article{wang2025trialpanorama,
title = {Developing Large Language Models for Clinical Research Using One Million Clinical Trials},
author = {Wang, Zifeng and Lin, Jiacheng and Jin, Qiao and Gao, Junyi and Pradeepkumar, Jathurshan and Jiang, Pengcheng and Lu, Zhiyong and Sun, Jimeng},
journal = {arXiv preprint arXiv:2505.16097},
year = {2025},
url = {https://arxiv.org/abs/2505.16097}
}