This directory contains benchmark datasets for evaluating biomedical data science agents. The benchmarks cover a range of tasks from data analysis coding to literature research and evidence synthesis.
| Benchmark | Type | # Tasks | Description |
|---|---|---|---|
| BioDSA-1K | Hypothesis Validation | 1,029 | Real biomedical hypothesis validation from published studies |
| BioDSBench-Python | Code Generation | 128 | Python coding tasks for biomedical data analysis |
| BioDSBench-R | Code Generation | 165 | R coding tasks for biomedical data analysis |
| DeepEvidence | Deep Research | 7 tasks | Deep knowledge graph research for biomedical discovery |
| HLE-Biomedicine | Reasoning | 40 | Hard biomedicine questions from Humanity's Last Exam |
| HLE-Medicine | Reasoning | 30 | Hard medicine questions from Humanity's Last Exam |
| LabBench | Literature QA | 75 | Literature and database question answering |
| SuperGPQA | Expert QA | 264 | Expert-level biology and medicine questions |
| TrialPanoramaBench | Evidence Synthesis | 50 | Clinical evidence synthesis |
| TRQA-lit | Literature QA | 172 | Translational research question answering |
Location: BioDSA-1K/
1,029 hypothesis validation tasks derived from real biomedical studies. Each task includes a hypothesis statement, supporting evidence, data tables, analysis plan, and ground truth labels.
📄 Paper: BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research
🤗 Full Dataset: HuggingFace - zifeng-ai/BioDSA-1K
Structure:
BioDSA-1K/
├── dataset/
│ └── biodsa_1k_hypothesis.parquet
└── README.md
Location: BioDSBench-Python/
128 Python coding tasks for biomedical data analysis, including data preprocessing, statistical analysis, and visualization.
📄 Paper: Can Large Language Models Replace Data Scientists in Biomedical Research?
🤗 Full Dataset: HuggingFace - zifeng-ai/BioDSBench
Structure:
BioDSBench-Python/
├── dataset/
│ ├── python_tasks_with_class.jsonl
│ └── python_task_table_schemas.jsonl
└── README.md
Location: BioDSBench-R/
165 R coding tasks for biomedical data analysis with similar task types to BioDSBench-Python.
🤗 Full Dataset: HuggingFace - zifeng-ai/BioDSBench
Structure:
BioDSBench-R/
├── dataset/
│ ├── R_tasks_with_class.jsonl
│ └── R_task_table_schemas.jsonl
└── README.md
Location: DeepEvidence/
Comprehensive benchmark for deep knowledge graph research tasks spanning the biomedical discovery pipeline. Each task requires agents to search and synthesize evidence from multiple biomedical knowledge bases.
📄 Paper: DeepEvidence: Empowering Biomedical Discovery with Deep Knowledge Graph Research (In submission)
🤗 Full Dataset: HuggingFace - zifeng-ai/DeepEvidence
Task Types:
| Task | File | Description |
|---|---|---|
| Target Identification | target_identification.parquet |
Identify therapeutic targets for diseases |
| MoA Pathway Reasoning | moa_pathway_reasoning.parquet |
Reason about drug mechanism of action pathways |
| In Vivo Metabolic Flux Response | in_vivo_metabolic_flux_response.parquet |
Predict metabolic responses in preclinical models |
| Drug Regimen Design | drug_regimen_design.parquet |
Design drug dosing regimens based on safety data |
| Surrogate Endpoint Discovery | surrogate_endpoint_discovery.parquet |
Identify surrogate endpoints for clinical trials |
| Sample Size Estimation | sample_size_estimation.parquet |
Estimate required sample sizes for trials |
| Evidence Gap Discovery | evidence_gap_discovery.parquet |
Identify gaps in existing clinical evidence |
Dataset Structure (HuggingFace):
DeepEvidence/
├── target_identification.parquet
├── moa_pathway_reasoning.parquet
├── in_vivo_metabolic_flux_response.parquet
├── drug_regimen_design.parquet
├── surrogate_endpoint_discovery.parquet
├── sample_size_estimation.parquet
└── evidence_gap_discovery.parquet
Location: HLE-biomedicine/
40 hard biomedicine questions from Humanity's Last Exam, filtered for questions that don't require images.
Files:
hle_biomedicine_40.csv- 40 selected biomedicine questions
Location: HLE-medicine/
30 hard medicine questions from Humanity's Last Exam.
Files:
hle_medicine_30.csv- 30 selected medicine questions
Location: LabBench/
Literature and database question answering benchmark.
Files:
LitQA2_25.csv- 25 literature QA questionsDBQA_50.csv- 50 database QA questions
Location: SuperGPQA/
Expert-level graduate and professional level questions in biology and medicine from SuperGPQA.
Files:
SuperGPQA-hard-medicine-172.csv- 172 hard medicine questions
Location: TrialPanoramaBench/
Benchmark for clinical trial design tasks.
Files:
evidence_synthesis_50.csv- 50 evidence synthesis tasks
Location: TRQA-lit/
Translational research question answering based on literature.
Files:
TRQA-lit-choice-172.csv- 172 multiple-choice questionsTRQA-lit-choice-coreset.csv- Core subset
import pandas as pd
# BioDSA-1K
df = pd.read_parquet("BioDSA-1K/dataset/biodsa_1k_hypothesis.parquet")import json
tasks = []
with open("BioDSBench-Python/dataset/python_tasks_with_class.jsonl") as f:
for line in f:
tasks.append(json.loads(line))import pandas as pd
df = pd.read_csv("SuperGPQA/SuperGPQA-hard-medicine-172.csv")If you use these benchmarks, please cite the relevant papers:
@article{wang2025deepevidence,
title={DeepEvidence: Empowering Biomedical Discovery with Deep Knowledge Graph Research},
author={Wang, Zifeng et al.},
journal={In submission},
year={2025}
}
@article{wang2025biodsa1k,
title={BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research},
author={Wang, Zifeng and Danek, Benjamin and Sun, Jimeng},
journal={arXiv preprint arXiv:2505.16100},
year={2025}
}
@article{wang2024llm,
title={Can Large Language Models Replace Data Scientists in Biomedical Research?},
author={Wang, Zifeng and Danek, Benjamin and Yang, Ziwei and Chen, Zheng and Sun, Jimeng},
journal={arXiv preprint arXiv:2410.21591},
year={2024}
}