Skip to content

Reproduction and Revival of the Argument Reasoning Comprehension Task

Notifications You must be signed in to change notification settings

nlx-group/arct-rep-rev

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reproduction and Revival of the Argument Reasoning Comprehension Task

Article

Authors

João António Rodrigues, Ruben Branco, João Ricardo Silva, António Branco

Paper

Reproduction and Revival of the Argument Reasoning Comprehension Task

Abstract

Reproduction of scientific findings is essential for scientific development across all scientific disciplines and reproducing results of previous works is a basic requirement for validating the hypothesis and conclusions put forward by them. This paper reports on the scientific reproduction of several systems addressing the Argument Reasoning Comprehension Task of SemEval2018.
Given a recent publication that pointed out spurious statistical cues in the data set used in the shared task, and that produced a revised version of it, we also evaluated the reproduced systems with this new data set. The exercise reported here shows that, in general, the reproduction of these systems is successful with scores in line with those reported in SemEval2018. However, the performance scores are worst than those, and even below the random baseline, when the reproduced systems are run over the revised data set expunged from data artifacts. This demonstrates that this task is actually a much harder challenge than what could have been perceived from the inflated, close to human-level performance scores obtained with the data set used in SemEval2018. This calls for a revival of this task as there is much room for improvement until systems may come close to the upper bound provided by human performance.

Data sets

Argument Reasoning Comprehension Task [Paper] [Data set]

Probing Neural Network Comprehension of Natural Language Arguments [Paper] [Revised Data set]

Data & Results

The original data set was split into 1,210 training instances, 316 development instances and 444 test instances, the revised data set was split into 2,420 training instances, 632 development instances and 888 test instances as presented in Table 1.

The ranking and scores of the systems submitted to the ARCT task are presented in Table 2. A survey of the description papers and how they stand concerning the reproducibility indicators is also presented in Table 2.

The results from the re-evaluation of the six systems reproduced with the revised data set are presented in Table 3.

Reproduction survey

Reproduction scores

System reports

For each reproduction attempt, a report with technical details can be found at:

About

Reproduction and Revival of the Argument Reasoning Comprehension Task

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages