A Benchmark for Reliability Evaluation of LLMs in Paper Search and Reading
Yutao Wu · Xiao Liu · Yunhao Feng · Jiale Ding · Xingjun Ma
Proceedings of the ACM Web Conference 2026 (WWW '26)
PaperAsk evaluates LLMs on four scholarly tasks under realistic web-interface usage, where search operations are opaque to the user. The repository currently releases the benchmark test cases and sample model outputs. Evaluation code and the reliability classifier will be released soon.
Across the benchmark we observe distinct, model-specific failure modes. Two representative cases:
The benchmark covers four task families, mirroring the paper's structure. Each lives in its own folder under test_cases/.
Given a list of paper titles, return the correct BibTeX entry for each. Stress-tests reliability as the number of references grows.
| File | Papers per query |
|---|---|
batch_3_evaluation.json |
3 |
batch_5_evaluation.json |
5 |
batch_10_evaluation.json |
10 |
Given an arXiv paper URL, extract structured fields (title, last sentence of introduction, figure/table counts, figure captions, …) and return a Python dict.
task_ContentExtraction.json— multi-domain papers with field-level ground truth.
Given a research topic and a target month/year, list all relevant arXiv papers published in that window. Measures recall over realistic literature-search queries.
QA_evaluation.json— topical queries with ground-truth paper sets.
Given a set of claims, verify each against the literature. Same batch-size scaling as citation retrieval.
| File | Claims per query |
|---|---|
batch_3_evaluation.json |
3 |
batch_5_evaluation.json |
5 |
batch_10_evaluation.json |
10 |
result_gpt5_QA/ contains GPT-5 responses on the paper-discovery task, annotated with success, recall, and a failure reason field — useful as a reference for both output format and failure modes.
PaperAsk/
├── test_cases/
│ ├── citation_retrieval/ # Task 1
│ ├── content_extraction/ # Task 2
│ ├── open_domain_QA/ # Task 3 (paper discovery)
│ └── open_book_CF/ # Task 4 (claim verification)
├── result_gpt5_QA/ # GPT-5 sample outputs (Task 3)
├── Example_GPT5.png
└── Example_error_Gemini.png
- Benchmark test cases (4 tasks)
- Sample model outputs (GPT-5, paper discovery)
- Evaluation scripts — coming soon
- Reliability classifier weights and training code — coming soon
If you use PaperAsk, please cite:
@inproceedings{wu2026paperask,
author = {Wu, Yutao and Liu, Xiao and Feng, Yunhao and Ding, Jiale and Ma, Xingjun},
title = {PaperAsk: A Benchmark for Reliability Evaluation of LLMs in Paper Search and Reading},
year = {2026},
isbn = {9798400723070},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3774904.3792597},
doi = {10.1145/3774904.3792597},
booktitle = {Proceedings of the ACM Web Conference 2026},
pages = {2330--2338},
numpages = {9},
keywords = {large language models, benchmark, llm reliability, search-augmented generation, scholarly information retrieval, hallucination},
location = {United Arab Emirates},
series = {WWW '26}
}
