Open
Description
When comparing multiple systems, it would be useful to perform the most appropriate test to determine if the differences in scores for each metric are statistically significant. We could probably reuse some of the code from https://github.com/rtmdrr/testSignificanceNLP