Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for GPQA: A Graduate-Level Google-Proof Q&A Benchmark Dataset #915

Open
RakshitKhajuria opened this issue Dec 5, 2023 · 0 comments
Assignees
Labels
⭐ Feature Indicates new feature requests

Comments

@RakshitKhajuria
Copy link
Contributor

Newly introduced benchmark dataset GPQA is a multiple-choice, Q&A dataset of very hard questions written and validated by experts in biology, physics, and chemistry. When attempting questions out of their own domain (e.g., a physicist answers a chemistry question), these experts get only 34% accuracy, despite spending >30m with full access to Google.

https://arxiv.org/pdf/2311.12022.pdf

We present GPQA, a challenging dataset of 448 multiple-choice questions written by
domain experts in biology, physics, and chemistry. We ensure that the questions are
high-quality and extremely difficult: experts who have or are pursuing PhDs in the
corresponding domains reach 65% accuracy (74% when discounting clear mistakes
the experts identified in retrospect), while highly skilled non-expert validators only
reach 34% accuracy, despite spending on average over 30 minutes with unrestricted
access to the web (i.e., the questions are “Google-proof”). The questions are also
difficult for state-of-the-art AI systems, with our strongest GPT-4–based baseline
achieving 39% accuracy. If we are to use future AI systems to help us answer
very hard questions—for example, when developing new scientific knowledge—we
need to develop scalable oversight methods that enable humans to supervise their
outputs, which may be difficult even if the supervisors are themselves skilled and
knowledgeable. The difficulty of GPQA both for skilled non-experts and frontier
AI systems should enable realistic scalable oversight experiments, which we hope
can help devise ways for human experts to reliably get truthful information from AI
systems that surpass human capabilities.

@RakshitKhajuria RakshitKhajuria added the ⭐ Feature Indicates new feature requests label Dec 5, 2023
@ArshaanNazir ArshaanNazir added the v2.1.0 Issue or request to be done in v2.1.0 release label Dec 5, 2023
@chakravarthik27 chakravarthik27 removed the v2.1.0 Issue or request to be done in v2.1.0 release label Mar 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⭐ Feature Indicates new feature requests
Projects
None yet
Development

No branches or pull requests

4 participants