BiomixQA Dataset

Overview

BiomixQA is a curated biomedical question-answering dataset comprising two distinct components:

Multiple Choice Questions (MCQ)
True/False Questions

This dataset has been utilized to validate the Knowledge Graph based Retrieval-Augmented Generation (KG-RAG) framework across different Large Language Models (LLMs). The diverse nature of questions in this dataset, spanning multiple choice and true/false formats, along with its coverage of various biomedical concepts, makes it particularly suitable for assessing the performance of KG-RAG framework.

Hence, this dataset is designed to support research and development in biomedical natural language processing, knowledge graph reasoning, and question-answering systems.

Dataset Description

Huggingface Repository: https://huggingface.co/datasets/kg-rag/BiomixQA
Paper: Biomedical knowledge graph-optimized prompt generation for large language models
Point of Contact: Karthik Soman

Dataset Components

1. Multiple Choice Questions (MCQ)

File: mcq_biomix.csv
Size: 306 questions
Format: Each question has five choices with a single correct answer

2. True/False Questions

File: true_false_biomix.csv
Size: 311 questions
Format: Binary (True/False) questions

Access data using Hugging Face

Following snippet shows how to load data in python

(i) MCQ data

from datasets import load_dataset

mcq_data = load_dataset("kg-rag/BiomixQA", "mcq")

(ii) True/False data

from datasets import load_dataset

tf_data = load_dataset("kg-rag/BiomixQA", "true_false")

Potential Uses

Evaluating biomedical question-answering systems
Testing natural language processing models in the biomedical domain
Assessing retrieval capabilities of various RAG (Retrieval-Augmented Generation) frameworks
Supporting research in biomedical ontologies and knowledge graphs

Performance Analysis

We conducted a comprehensive analysis of the performance of three Large Language Models (LLMs) - Llama-2-13b, GPT-3.5-Turbo (0613), and GPT-4 - on the BiomixQA dataset. We compared their performance using both a standard prompt-based approach (zero-shot) and our novel Knowledge Graph based Retrieval-Augmented Generation (KG-RAG) framework.

Performance Summary

Table 1: Performance (accuracy) of LLMs on BiomixQA datasets using prompt-based (zero-shot) and KG-RAG approaches (For more details, refer this paper)

Model	True/False Dataset		MCQ Dataset
	Prompt-based	KG-RAG	Prompt-based	KG-RAG
Llama-2-13b	0.89 ± 0.02	0.94 ± 0.01	0.31 ± 0.03	0.53 ± 0.03
GPT-3.5-Turbo (0613)	0.87 ± 0.02	0.95 ± 0.01	0.63 ± 0.03	0.79 ± 0.02
GPT-4	0.90 ± 0.02	0.95 ± 0.01	0.68 ± 0.03	0.74 ± 0.03

Key Observations

Consistent Performance Enhancement: We observed a consistent performance enhancement for all LLM models when using the KG-RAG framework on both True/False and MCQ datasets.
Significant Improvement for Llama-2: The KG-RAG framework significantly elevated the performance of Llama-2-13b, particularly on the more challenging MCQ dataset. We observed an impressive 71% increase in accuracy, from 0.31 ± 0.03 to 0.53 ± 0.03.
GPT-4 vs GPT-3.5-Turbo on MCQ: Intriguingly, we observed a small but statistically significant drop in the performance of the GPT-4 model (0.74 ± 0.03) compared to the GPT-3.5-Turbo model (0.79 ± 0.02) on the MCQ dataset when using the KG-RAG framework. This difference was not observed in the prompt-based approach.
- Statistical significance: T-test, p-value < 0.0001, t-statistic = -47.7, N = 1000
True/False Dataset Performance: All models showed high performance on the True/False dataset, with the KG-RAG approach yielding slightly better results across all models.

Source Data

SPOKE: A large scale biomedical knowledge graph that consists of ~40 million biomedical concepts and ~140 million biologically meaningful relationships (Morris et al. 2023).
DisGeNET: Consolidates data about genes and genetic variants linked to human diseases from curated repositories, the GWAS catalog, animal models, and scientific literature (Piñero et al. 2016).
MONDO: Provides information about the ontological classification of Disease entities in the Open Biomedical Ontologies (OBO) format (Vasilevsky et al. 2022).
SemMedDB: Contains semantic predications extracted from PubMed citations (Kilicoglu et al. 2012).
Monarch Initiative: A platform for disease-gene association data (Mungall et al. 2017).
ROBOKOP: A knowledge graph-based system for biomedical data integration and analysis (Bizon et al. 2019).

Citation

If you use this dataset in your research, please cite the following paper:

@article{soman2023biomedical,
  title={Biomedical knowledge graph-enhanced prompt generation for large language models},
  author={Soman, Karthik and Rose, Peter W and Morris, John H and Akbas, Rabia E and Smith, Brett and Peetoom, Braian and Villouta-Reyes, Catalina and Cerono, Gabriel and Shi, Yongmei and Rizk-Jackson, Angela and others},
  journal={arXiv preprint arXiv:2311.17330},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
gpt_benchmarking_using_biomixQA.ipynb		gpt_benchmarking_using_biomixQA.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BiomixQA Dataset

Overview

Dataset Description

Dataset Components

1. Multiple Choice Questions (MCQ)

2. True/False Questions

Access data using Hugging Face

Potential Uses

Performance Analysis

Performance Summary

Key Observations

Source Data

Citation

About

Releases

Packages

Languages

License

karthiksoman/biomixQA

Folders and files

Latest commit

History

Repository files navigation

BiomixQA Dataset

Overview

Dataset Description

Dataset Components

1. Multiple Choice Questions (MCQ)

2. True/False Questions

Access data using Hugging Face

Potential Uses

Performance Analysis

Performance Summary

Key Observations

Source Data

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages