TrialPanorama: Developing Large Language Models for Clinical Research

TrialPanorama is a large-scale structured resource that aggregates 1.6M clinical trial records from fifteen global registries and links them with biomedical ontologies and associated literature. This repository provides a comprehensive benchmark framework for evaluating language models on clinical trial-related tasks, designed to assess model capabilities in understanding and reasoning about medical research.

Systematic Review Tasks:
- Study Search: Finding relevant clinical trials given a systematic review setup
- Study Screening: Determining whether clinical trials should be included in a systematic review based on eligibility criteria
- Evidence Summary: Generating evidence summaries from clinical trial data
Clinical Trial Design Tasks:
- Trial Completion Assessment: Predicting whether a clinical trial will complete successfully or terminate prematurely, including the reason for termination
- Arm Design: Designing appropriate trial arms
- Eligibility Criteria Design: Creating inclusion/exclusion criteria
- Endpoint Design: Defining primary and secondary outcomes
- Sample Size Estimation: Calculating appropriate sample sizes

Paper and Dataset

📄 Paper: Developing Large Language Models for Clinical Research Using One Million Clinical Trials

We introduce TrialPanorama, a comprehensive resource for developing and evaluating AI systems for clinical research. The dataset includes:

1.6M clinical trial records from fifteen global registries
Links to biomedical ontologies and associated literature
152K training and testing samples across eight clinical research tasks

🤗 Dataset: TrialPanorama/Dataset

The dataset includes supervised fine-tuning data for:

Systematic review workflows (study search, screening, evidence summarization)
Trial design and optimization (arm design, eligibility criteria, endpoints, sample size estimation, completion assessment)

Installation

The benchmark framework requires Python 3.8+ and uses pipenv for dependency management.

# Install pipenv if you don't have it
pip install pipenv

# Install dependencies using pipenv
pipenv install

# Activate the virtual environment
pipenv shell

All benchmark scripts should be run within the pipenv virtual environment.

Setup Environment Variables

Create a .env file in the root directory with your API keys:

# Azure OpenAI
AZURE_OPENAI_API_KEY=your_azure_openai_api_key
AZURE_OPENAI_ENDPOINT=your_azure_openai_endpoint

# Azure OpenAI Deployment Configuration
AZURE_DEPLOYMENT_GPT4O=deployment_name_for_gpt4o
AZURE_DEPLOYMENT_GPT4O_MINI=deployment_name_for_gpt4o_mini
AZURE_DEPLOYMENT_O3_MINI=deployment_name_for_o3_mini

# PubMed API (optional, for study search)
PUBMED_API_KEY=your_pubmed_api_key

# Path Configuration
DATABASE_PATH=/path/to/benchmark_data
BENCHMARK_DATA_PATH=/path/to/benchmark_results

The .env file will be automatically loaded by pipenv when you activate the virtual environment.

Dataset Structure

Benchmark datasets should be structured as follows:

/path/to/benchmark_data/
├── study_search/
│   ├── train.jsonl
│   └── test.jsonl
├── study_screening/
│   ├── train.jsonl
│   └── test.jsonl
├── trial_completion/
│   ├── train.jsonl
│   └── test.jsonl
├── design_arms_qa/
│   ├── train.jsonl
│   └── test.jsonl
├── design_criteria_qa/
│   ├── train.jsonl
│   └── test.jsonl
├── design_outcome_qa/
│   ├── train.jsonl
│   └── test.jsonl
├── evidence_summary_qa/
│   ├── train.jsonl
│   └── test.jsonl
└── sample_size_estimation/
    ├── train.jsonl
    └── test.jsonl

Each dataset follows a specific format detailed in the task documentation.

Verifying System Setup

Before running full benchmarks, you can verify your setup using the sanity check script:

# Make sure you're in the pipenv shell
pipenv shell

# Run with default settings (gpt-4o-mini and 2 samples per task)
./sanity_check.sh

This script:

Runs each benchmark with a minimal number of samples
Reports success/failure for each task
Logs detailed output for troubleshooting
Creates a logs/ directory with results from each test

If all tests pass, your environment is correctly configured for running benchmarks.

Running Individual Benchmarks

Ensure you're in the pipenv environment before running any benchmark script:

pipenv shell

Study Search

# Run a single model
python benchmark_scripts/run_study_search.py --model-name gpt-4o

# Customize the system prompt
python benchmark_scripts/run_study_search.py --model-name gpt-4o --system-prompt "Your custom system prompt"

# Limit the number of samples
python benchmark_scripts/run_study_search.py --model-name gpt-4o --num-samples 10

Study Screening

# Run a single model
python benchmark_scripts/run_study_screening.py --model-name gpt-4o

# Customize options
python benchmark_scripts/run_study_screening.py --model-name gpt-4o --system-prompt "Custom prompt" --num-samples 10

Trial Completion Assessment

# Run a single model
python benchmark_scripts/run_trial_completion.py --model-name gpt-4o

# Customize options
python benchmark_scripts/run_trial_completion.py --model-name gpt-4o --system-prompt "Custom prompt" --num-samples 10

Clinical Trial Design Tasks

# Arm Design
python benchmark_scripts/run_arm_design.py --model-name gpt-4o

# Eligibility Criteria Design
python benchmark_scripts/run_eligibility_criteria_design.py --model-name gpt-4o

# Endpoint Design
python benchmark_scripts/run_endpoint_design.py --model-name gpt-4o

# Evidence Summary
python benchmark_scripts/run_evidence_summary.py --model-name gpt-4o

# Sample Size Estimation
python benchmark_scripts/run_sample_size_estimation.py --model-name gpt-4o

Running Batch Benchmarks

For each task, a shell script is provided to run benchmarks for multiple models in parallel. Make sure you're in the pipenv environment before running these scripts:

pipenv shell

Then run the desired benchmark script:

# Study-related Tasks
./benchmark_study_search.sh
./benchmark_study_screening.sh
./benchmark_evidence_summary.sh

# Design-related Tasks
./benchmark_arm_design.sh
./benchmark_eligibility_criteria_design.sh
./benchmark_endpoint_design.sh
./benchmark_evidence_summary.sh
./benchmark_sample_size_estimation.sh
./benchmark_trial_completion.sh

These scripts:

Run benchmarks for three models (gpt-4o-mini, gpt-4o, and o3-mini)
Use nohup to ensure processes continue even if your terminal session closes
Log output to task-specific log files
Return process IDs for monitoring

Checking Batch Progress

# View running benchmark processes
ps aux | grep run_

# Check log files for a specific task
tail -f logs/endpoint_design_gpt-4o.log

Benchmark Output Structure

Results are saved to the path specified in your .env file (BENCHMARK_DATA_PATH):

benchmark_results/
├── study_search/
│   └── [model_name]/
│       └── [timestamp]/
│           ├── results.json       # Detailed results including predictions
│           └── metrics.json       # Summary metrics only
├── study_screening/
├── trial_completion/
├── design_arms_qa/
├── design_criteria_qa/
├── design_outcome_qa/
├── evidence_summary_qa/
└── sample_size_estimation/

Performance Metrics

Each task reports specific metrics:

Study Search & Screening Metrics

precision, recall, f1: Standard classification metrics
accuracy: Overall accuracy across all studies

Trial Completion Metrics

outcome_prediction.accuracy: Accuracy of predicting completion vs termination
termination_type.accuracy: Accuracy of predicting the specific termination reason

QA Tasks Metrics

accuracy: Proportion of questions answered correctly
f1_score: For questions with potentially multiple correct answers

Customizing the Framework

Models

The framework currently supports Azure OpenAI models (gpt-4o, gpt-4o-mini, o3-mini, etc.).

You can modify benchmark/models/ to add support for additional models.

Tasks

Each task is implemented as a class that inherits from the base Task class.

Task classes handle:

Input preparation
Prediction parsing
Evaluation against ground truth
Results calculation

Citation

If you find this project useful, please cite our paper:

@article{wang2025trialpanorama,
  title     = {Developing Large Language Models for Clinical Research Using One Million Clinical Trials},
  author    = {Wang, Zifeng and Lin, Jiacheng and Jin, Qiao and Gao, Junyi and Pradeepkumar, Jathurshan and Jiang, Pengcheng and Lu, Zhiyong and Sun, Jimeng},
  journal   = {arXiv preprint arXiv:2505.16097},
  year      = {2025},
  url       = {https://arxiv.org/abs/2505.16097}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TrialPanorama: Developing Large Language Models for Clinical Research

Paper and Dataset

Installation

Setup Environment Variables

Dataset Structure

Verifying System Setup

Running Individual Benchmarks

Study Search

Study Screening

Trial Completion Assessment

Clinical Trial Design Tasks

Running Batch Benchmarks

Checking Batch Progress

Benchmark Output Structure

Performance Metrics

Study Search & Screening Metrics

Trial Completion Metrics

QA Tasks Metrics

Customizing the Framework

Models

Tasks

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
benchmark		benchmark
benchmark_scripts		benchmark_scripts
data_pipeline		data_pipeline
train		train
visualization		visualization
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
benchmark_arm_design.sh		benchmark_arm_design.sh
benchmark_eligibility_criteria_design.sh		benchmark_eligibility_criteria_design.sh
benchmark_endpoint_design.sh		benchmark_endpoint_design.sh
benchmark_evidence_summary.sh		benchmark_evidence_summary.sh
benchmark_sample_size_estimation.sh		benchmark_sample_size_estimation.sh
benchmark_study_screening.sh		benchmark_study_screening.sh
benchmark_study_search.sh		benchmark_study_search.sh
benchmark_trial_completion.sh		benchmark_trial_completion.sh
sanity_check.sh		sanity_check.sh

Folders and files

Latest commit

History

Repository files navigation

TrialPanorama: Developing Large Language Models for Clinical Research

Paper and Dataset

Installation

Setup Environment Variables

Dataset Structure

Verifying System Setup

Running Individual Benchmarks

Study Search

Study Screening

Trial Completion Assessment

Clinical Trial Design Tasks

Running Batch Benchmarks

Checking Batch Progress

Benchmark Output Structure

Performance Metrics

Study Search & Screening Metrics

Trial Completion Metrics

QA Tasks Metrics

Customizing the Framework

Models

Tasks

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages