Skip to content

GeoGPT-Research-Project/GeoRAG-QA-Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

GeoRAG-QA: A Benchmark Test Set for Geoscience Information Retrieval

1. Dataset Description

We introduce GeoRAG-QA, a curated test set designed for evaluating retrieval-augmented generation (RAG) systems and information retrieval approaches in the geoscience domain.

GeoRAG-QA was constructed using the Test Set Generation module from RAGAS (Es et al., 2023). The dataset consists of automatically generated QA items based on open-access geoscience publications (see GeoGPT Training Data from Open Access Papers for reference), licensed under CC BY or CC BY-NC. During the generation process, some QA items were incomplete or invalid. After manual review, these flawed entries were removed, yielding a final set of 938 high-quality QA items. The dataset is organized into four categories:

  • 250 single-document fact-based QA pairs
  • 250 single-document inference QA pairs
  • 188 multi-document QA pairs
  • 250 conditional QA pairs

In cases where the generated reference answer was incorrectly labeled as “The answer to the given question is not present in the context”, we re-generated the reference answers using GPT-4o to improve accuracy, consistency, and evaluation reliability. For more details regarding dataset construction and methodology, please refer to the technical report.

To download the dataset, please visit: https://huggingface.co/datasets/GeoGPT-Research-Project/GeoRAG-QA

2. How to download

Using 🤗datasets

import datasets

data=datasets.load_dataset('GeoGPT-Research-Project/GeoRAG-QA')

3. License and Intended Uses

License:GeoRAG-QA is released under the Creative Commons Attribution-NonCommercial (CC BY-NC) license.

Copyright: Copyright (c) 2025 Zhejiang Lab. All rights reserved.

Intended Use:The dataset is intended for non-commercial research and educational purposes, particularly for:

  • Evaluating retrieval and RAG pipelines in geoscience,
  • Supporting the development of domain-specific benchmarks,
  • Advancing methods for QA and information access in scientific domains.

It must not be used for commercial purposes, activities that violate laws or regulations, or in any manner inconsistent with the license terms.

4. Contact

For questions, feedback, or contributions, please open an issue in this repository or contact us at 📧 support.geogpt@zhejianglab.org.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published