Skip to content

wuyoscar/PaperAsk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PaperAsk

A Benchmark for Reliability Evaluation of LLMs in Paper Search and Reading
Yutao Wu · Xiao Liu · Yunhao Feng · Jiale Ding · Xingjun Ma
Proceedings of the ACM Web Conference 2026 (WWW '26)

DOI Venue Status


Overview

PaperAsk evaluates LLMs on four scholarly tasks under realistic web-interface usage, where search operations are opaque to the user. The repository currently releases the benchmark test cases and sample model outputs. Evaluation code and the reliability classifier will be released soon.

Failure Examples

Across the benchmark we observe distinct, model-specific failure modes. Two representative cases:

GPT-5 hedging on content extraction
GPT-5 (content extraction) — withholds a definitive answer when asked to count tables in a specific PDF, falling back to "as commonly cited" rather than reading the source.
Gemini fabricated arXiv ID
Gemini-2.5-Flash (citation retrieval) — returns a fluent BibTeX entry with a fabricated arXiv ID (2405.15875) that actually points to an unrelated paper.

Tasks

The benchmark covers four task families, mirroring the paper's structure. Each lives in its own folder under test_cases/.

1. Citation Retrieval — test_cases/citation_retrieval/

Given a list of paper titles, return the correct BibTeX entry for each. Stress-tests reliability as the number of references grows.

File Papers per query
batch_3_evaluation.json 3
batch_5_evaluation.json 5
batch_10_evaluation.json 10

2. Content Extraction — test_cases/content_extraction/

Given an arXiv paper URL, extract structured fields (title, last sentence of introduction, figure/table counts, figure captions, …) and return a Python dict.

  • task_ContentExtraction.json — multi-domain papers with field-level ground truth.

3. Paper Discovery — test_cases/open_domain_QA/

Given a research topic and a target month/year, list all relevant arXiv papers published in that window. Measures recall over realistic literature-search queries.

  • QA_evaluation.json — topical queries with ground-truth paper sets.

4. Claim Verification — test_cases/open_book_CF/

Given a set of claims, verify each against the literature. Same batch-size scaling as citation retrieval.

File Claims per query
batch_3_evaluation.json 3
batch_5_evaluation.json 5
batch_10_evaluation.json 10

Sample Outputs

result_gpt5_QA/ contains GPT-5 responses on the paper-discovery task, annotated with success, recall, and a failure reason field — useful as a reference for both output format and failure modes.

Repository Structure

PaperAsk/
├── test_cases/
│   ├── citation_retrieval/         # Task 1
│   ├── content_extraction/         # Task 2
│   ├── open_domain_QA/             # Task 3 (paper discovery)
│   └── open_book_CF/               # Task 4 (claim verification)
├── result_gpt5_QA/                 # GPT-5 sample outputs (Task 3)
├── Example_GPT5.png
└── Example_error_Gemini.png

Release Status

  • Benchmark test cases (4 tasks)
  • Sample model outputs (GPT-5, paper discovery)
  • Evaluation scripts — coming soon
  • Reliability classifier weights and training code — coming soon

Citation

If you use PaperAsk, please cite:

@inproceedings{wu2026paperask,
  author    = {Wu, Yutao and Liu, Xiao and Feng, Yunhao and Ding, Jiale and Ma, Xingjun},
  title     = {PaperAsk: A Benchmark for Reliability Evaluation of LLMs in Paper Search and Reading},
  year      = {2026},
  isbn      = {9798400723070},
  publisher = {Association for Computing Machinery},
  address   = {New York, NY, USA},
  url       = {https://doi.org/10.1145/3774904.3792597},
  doi       = {10.1145/3774904.3792597},
  booktitle = {Proceedings of the ACM Web Conference 2026},
  pages     = {2330--2338},
  numpages  = {9},
  keywords  = {large language models, benchmark, llm reliability, search-augmented generation, scholarly information retrieval, hallucination},
  location  = {United Arab Emirates},
  series    = {WWW '26}
}

About

[WWW'26] A Benchmark for Reliability Evaluation of LLMs in Paper Search and Reading

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors