Please add more reasoning evaluation benchmark related work

Hi,

Thanks for sharing the survey paper about evaluating large language model.

Furthermore, we have more evaluation papers working on logical reasoning. Please consider to add those papers into your arXiv paper if you find them related. Thanks a lot.


### Out-of-Distribution Logical Reasoning Evaluation and Prompt Augmentation for Enhancing OOD Logical Reasoning
We present a systematically out-of-distribution evaluation on logical reasoning tasks. We presented three new more robust logical reasoning datasets ReClor-Plus, LogiQA-Plus and LogiQAv2-Plus which are basically constructed from ReClor, LogiQA and LogiQAv2 from the changes of option's order and forms. We found simply using chain-of-thought prompting will not increase models' performance on the out-of-distribution scenario while using our AMR-based logic-driven data augmentation to augment prompt can increase large language models' performance on out-of-distribution logical reasoning tasks. The three datasets have been collected by OpenAI/Evals.

[LLM@IJCAI 2023] "A Systematic Evaluation of Large Language Models on Out-of-Distribution Logical Reasoning Tasks" [[Paper link](https://arxiv.org/abs/2310.09430v1)]

The full version named "Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning" has been accepted by ICONIP 2024. [[Paper link](https://arxiv.org/abs/2310.09430)] [[Source code](https://github.com/Strong-AI-Lab/Logical-and-abstract-reasoning)] [https://github.com/openai/evals/pull/648].

### A Empirical Study on Out-Of-Distribution Multi-Step Logical Reasoning
We find that pre-trained language models are not good at on robust multi-step logical reasoning tasks and one of the main reason is that there is limited amount of training sets for deeper multi-step logical reasoning. Therefore, we present a deeper large multi-step logical reasoning datasets named PARARULE-Plus. The dataset has also been collected by OpenAI/Evals.
[IJCLR-NeSy 2022] "Multi-Step Deductive Reasoning Over Natural Language: An Empirical Study on Out-of-Distribution Generalisation" [[Paper link](https://ceur-ws.org/Vol-3212/paper15.pdf)] [[Source code](https://github.com/Strong-AI-Lab/Multi-Step-Deductive-Reasoning-Over-Natural-Language)] [https://github.com/openai/evals/pull/651].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Please add more reasoning evaluation benchmark related work #33

Out-of-Distribution Logical Reasoning Evaluation and Prompt Augmentation for Enhancing OOD Logical Reasoning

A Empirical Study on Out-Of-Distribution Multi-Step Logical Reasoning

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Please add more reasoning evaluation benchmark related work #33

Description

Out-of-Distribution Logical Reasoning Evaluation and Prompt Augmentation for Enhancing OOD Logical Reasoning

A Empirical Study on Out-Of-Distribution Multi-Step Logical Reasoning

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions