PaperAsk

A Benchmark for Reliability Evaluation of LLMs in Paper Search and Reading
_{Yutao Wu · Xiao Liu · Yunhao Feng · Jiale Ding · Xingjun Ma}
_{Proceedings of the ACM Web Conference 2026 (WWW '26)}

Overview

PaperAsk evaluates LLMs on four scholarly tasks under realistic web-interface usage, where search operations are opaque to the user. The repository currently releases the benchmark test cases and sample model outputs. Evaluation code and the reliability classifier will be released soon.

Failure Examples

Across the benchmark we observe distinct, model-specific failure modes. Two representative cases:

_{GPT-5 (content extraction) — withholds a definitive answer when asked to count tables in a specific PDF, falling back to "as commonly cited" rather than reading the source.} _{Gemini-2.5-Flash (citation retrieval) — returns a fluent BibTeX entry with a fabricated arXiv ID (2405.15875) that actually points to an unrelated paper.}

Tasks

The benchmark covers four task families, mirroring the paper's structure. Each lives in its own folder under test_cases/.

1. Citation Retrieval — `test_cases/citation_retrieval/`

Given a list of paper titles, return the correct BibTeX entry for each. Stress-tests reliability as the number of references grows.

File	Papers per query
`batch_3_evaluation.json`	3
`batch_5_evaluation.json`	5
`batch_10_evaluation.json`	10

2. Content Extraction — `test_cases/content_extraction/`

Given an arXiv paper URL, extract structured fields (title, last sentence of introduction, figure/table counts, figure captions, …) and return a Python dict.

task_ContentExtraction.json — multi-domain papers with field-level ground truth.

3. Paper Discovery — `test_cases/open_domain_QA/`

Given a research topic and a target month/year, list all relevant arXiv papers published in that window. Measures recall over realistic literature-search queries.

QA_evaluation.json — topical queries with ground-truth paper sets.

4. Claim Verification — `test_cases/open_book_CF/`

Given a set of claims, verify each against the literature. Same batch-size scaling as citation retrieval.

File	Claims per query
`batch_3_evaluation.json`	3
`batch_5_evaluation.json`	5
`batch_10_evaluation.json`	10

Sample Outputs

result_gpt5_QA/ contains GPT-5 responses on the paper-discovery task, annotated with success, recall, and a failure reason field — useful as a reference for both output format and failure modes.

Repository Structure

PaperAsk/
├── test_cases/
│   ├── citation_retrieval/         # Task 1
│   ├── content_extraction/         # Task 2
│   ├── open_domain_QA/             # Task 3 (paper discovery)
│   └── open_book_CF/               # Task 4 (claim verification)
├── result_gpt5_QA/                 # GPT-5 sample outputs (Task 3)
├── Example_GPT5.png
└── Example_error_Gemini.png

Release Status

Benchmark test cases (4 tasks)
Sample model outputs (GPT-5, paper discovery)
Evaluation scripts — coming soon
Reliability classifier weights and training code — coming soon

Citation

If you use PaperAsk, please cite:

@inproceedings{wu2026paperask,
  author    = {Wu, Yutao and Liu, Xiao and Feng, Yunhao and Ding, Jiale and Ma, Xingjun},
  title     = {PaperAsk: A Benchmark for Reliability Evaluation of LLMs in Paper Search and Reading},
  year      = {2026},
  isbn      = {9798400723070},
  publisher = {Association for Computing Machinery},
  address   = {New York, NY, USA},
  url       = {https://doi.org/10.1145/3774904.3792597},
  doi       = {10.1145/3774904.3792597},
  booktitle = {Proceedings of the ACM Web Conference 2026},
  pages     = {2330--2338},
  numpages  = {9},
  keywords  = {large language models, benchmark, llm reliability, search-augmented generation, scholarly information retrieval, hallucination},
  location  = {United Arab Emirates},
  series    = {WWW '26}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
result_gpt5_QA		result_gpt5_QA
test_cases		test_cases
.gitignore		.gitignore
Example_GPT5.png		Example_GPT5.png
Example_error_Gemini.png		Example_error_Gemini.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PaperAsk

Overview

Failure Examples

Tasks

1. Citation Retrieval — `test_cases/citation_retrieval/`

2. Content Extraction — `test_cases/content_extraction/`

3. Paper Discovery — `test_cases/open_domain_QA/`

4. Claim Verification — `test_cases/open_book_CF/`

Sample Outputs

Repository Structure

Release Status

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

PaperAsk

Overview

Failure Examples

Tasks

1. Citation Retrieval — test_cases/citation_retrieval/

2. Content Extraction — test_cases/content_extraction/

3. Paper Discovery — test_cases/open_domain_QA/

4. Claim Verification — test_cases/open_book_CF/

Sample Outputs

Repository Structure

Release Status

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

1. Citation Retrieval — `test_cases/citation_retrieval/`

2. Content Extraction — `test_cases/content_extraction/`

3. Paper Discovery — `test_cases/open_domain_QA/`

4. Claim Verification — `test_cases/open_book_CF/`

Packages