Name	Name	Last commit message	Last commit date
parent directory ..
BioDSA-1K	BioDSA-1K
BioDSBench-Python	BioDSBench-Python
BioDSBench-R	BioDSBench-R
HLE-biomedicine	HLE-biomedicine
HLE-medicine	HLE-medicine
LabBench	LabBench
SuperGPQA	SuperGPQA
TRQA-lit	TRQA-lit
TrialPanoramaBench	TrialPanoramaBench
README.md	README.md

BioDSA Benchmarks

This directory contains benchmark datasets for evaluating biomedical data science agents. The benchmarks cover a range of tasks from data analysis coding to literature research and evidence synthesis.

Overview

Benchmark	Type	# Tasks	Description
BioDSA-1K	Hypothesis Validation	1,029	Real biomedical hypothesis validation from published studies
BioDSBench-Python	Code Generation	128	Python coding tasks for biomedical data analysis
BioDSBench-R	Code Generation	165	R coding tasks for biomedical data analysis
DeepEvidence	Deep Research	7 tasks	Deep knowledge graph research for biomedical discovery
HLE-Biomedicine	Reasoning	40	Hard biomedicine questions from Humanity's Last Exam
HLE-Medicine	Reasoning	30	Hard medicine questions from Humanity's Last Exam
LabBench	Literature QA	75	Literature and database question answering
SuperGPQA	Expert QA	264	Expert-level biology and medicine questions
TrialPanoramaBench	Evidence Synthesis	50	Clinical evidence synthesis
TRQA-lit	Literature QA	172	Translational research question answering

BioDSA-1K

Location: BioDSA-1K/

1,029 hypothesis validation tasks derived from real biomedical studies. Each task includes a hypothesis statement, supporting evidence, data tables, analysis plan, and ground truth labels.

📄 Paper: BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research

🤗 Full Dataset: HuggingFace - zifeng-ai/BioDSA-1K

Structure:

BioDSA-1K/
├── dataset/
│   └── biodsa_1k_hypothesis.parquet
└── README.md

BioDSBench-Python

Location: BioDSBench-Python/

128 Python coding tasks for biomedical data analysis, including data preprocessing, statistical analysis, and visualization.

📄 Paper: Can Large Language Models Replace Data Scientists in Biomedical Research?

🤗 Full Dataset: HuggingFace - zifeng-ai/BioDSBench

Structure:

BioDSBench-Python/
├── dataset/
│   ├── python_tasks_with_class.jsonl
│   └── python_task_table_schemas.jsonl
└── README.md

BioDSBench-R

Location: BioDSBench-R/

165 R coding tasks for biomedical data analysis with similar task types to BioDSBench-Python.

🤗 Full Dataset: HuggingFace - zifeng-ai/BioDSBench

Structure:

BioDSBench-R/
├── dataset/
│   ├── R_tasks_with_class.jsonl
│   └── R_task_table_schemas.jsonl
└── README.md

DeepEvidence

Location: DeepEvidence/

Comprehensive benchmark for deep knowledge graph research tasks spanning the biomedical discovery pipeline. Each task requires agents to search and synthesize evidence from multiple biomedical knowledge bases.

📄 Paper: DeepEvidence: Empowering Biomedical Discovery with Deep Knowledge Graph Research (In submission)

🤗 Full Dataset: HuggingFace - zifeng-ai/DeepEvidence

Task Types:

Task	File	Description
Target Identification	`target_identification.parquet`	Identify therapeutic targets for diseases
MoA Pathway Reasoning	`moa_pathway_reasoning.parquet`	Reason about drug mechanism of action pathways
In Vivo Metabolic Flux Response	`in_vivo_metabolic_flux_response.parquet`	Predict metabolic responses in preclinical models
Drug Regimen Design	`drug_regimen_design.parquet`	Design drug dosing regimens based on safety data
Surrogate Endpoint Discovery	`surrogate_endpoint_discovery.parquet`	Identify surrogate endpoints for clinical trials
Sample Size Estimation	`sample_size_estimation.parquet`	Estimate required sample sizes for trials
Evidence Gap Discovery	`evidence_gap_discovery.parquet`	Identify gaps in existing clinical evidence

Dataset Structure (HuggingFace):

DeepEvidence/
├── target_identification.parquet
├── moa_pathway_reasoning.parquet
├── in_vivo_metabolic_flux_response.parquet
├── drug_regimen_design.parquet
├── surrogate_endpoint_discovery.parquet
├── sample_size_estimation.parquet
└── evidence_gap_discovery.parquet

HLE-Biomedicine

Location: HLE-biomedicine/

40 hard biomedicine questions from Humanity's Last Exam, filtered for questions that don't require images.

Files:

hle_biomedicine_40.csv - 40 selected biomedicine questions

HLE-Medicine

Location: HLE-medicine/

30 hard medicine questions from Humanity's Last Exam.

Files:

hle_medicine_30.csv - 30 selected medicine questions

LabBench

Location: LabBench/

Literature and database question answering benchmark.

Files:

LitQA2_25.csv - 25 literature QA questions
DBQA_50.csv - 50 database QA questions

SuperGPQA

Location: SuperGPQA/

Expert-level graduate and professional level questions in biology and medicine from SuperGPQA.

Files:

SuperGPQA-hard-medicine-172.csv - 172 hard medicine questions

TrialPanoramaBench

Location: TrialPanoramaBench/

Benchmark for clinical trial design tasks.

Files:

evidence_synthesis_50.csv - 50 evidence synthesis tasks

TRQA-lit

Location: TRQA-lit/

Translational research question answering based on literature.

Files:

TRQA-lit-choice-172.csv - 172 multiple-choice questions
TRQA-lit-choice-coreset.csv - Core subset

Usage

Loading Parquet Files

import pandas as pd

# BioDSA-1K
df = pd.read_parquet("BioDSA-1K/dataset/biodsa_1k_hypothesis.parquet")

Loading JSONL Files

import json

tasks = []
with open("BioDSBench-Python/dataset/python_tasks_with_class.jsonl") as f:
    for line in f:
        tasks.append(json.loads(line))

Loading CSV Files

import pandas as pd

df = pd.read_csv("SuperGPQA/SuperGPQA-hard-medicine-172.csv")

Citation

If you use these benchmarks, please cite the relevant papers:

@article{wang2025deepevidence,
  title={DeepEvidence: Empowering Biomedical Discovery with Deep Knowledge Graph Research},
  author={Wang, Zifeng et al.},
  journal={In submission},
  year={2025}
}

@article{wang2025biodsa1k,
  title={BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research},
  author={Wang, Zifeng and Danek, Benjamin and Sun, Jimeng},
  journal={arXiv preprint arXiv:2505.16100},
  year={2025}
}

@article{wang2024llm,
  title={Can Large Language Models Replace Data Scientists in Biomedical Research?},
  author={Wang, Zifeng and Danek, Benjamin and Yang, Ziwei and Chen, Zheng and Sun, Jimeng},
  journal={arXiv preprint arXiv:2410.21591},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

BioDSA Benchmarks

Overview

BioDSA-1K

BioDSBench-Python

BioDSBench-R

DeepEvidence

HLE-Biomedicine

HLE-Medicine

LabBench

SuperGPQA

TrialPanoramaBench

TRQA-lit

Usage

Loading Parquet Files

Loading JSONL Files

Loading CSV Files

Citation

FilesExpand file tree

benchmarks

Directory actions

More options

Directory actions

More options

Latest commit

History

benchmarks

Folders and files

parent directory

README.md

BioDSA Benchmarks

Overview

BioDSA-1K

BioDSBench-Python

BioDSBench-R

DeepEvidence

HLE-Biomedicine

HLE-Medicine

LabBench

SuperGPQA

TrialPanoramaBench

TRQA-lit

Usage

Loading Parquet Files

Loading JSONL Files

Loading CSV Files

Citation