This testsuite aims at giving developers the opportunity to compare evaluation metrics to both human judgement and some commonly used NLG or STS metrics, regarding specific linguistic phenomena.
Its main objectives are:
- To help users develop new and improve old metrics by providing them with helpful information such as:
- Statistics about the annotated scores for each phenomenon
- Human Judgement and scores for a selection of NLG and STS metrics for individual test cases
- Average Scores for each phenomenon
- Correlation matrizes including correlation between all metrics and human judgement for each phenomenon and overall
- To make contributing to the CheckList easy and simple
Note that there are two requirement files in this repository, requirements.txt
and requirements_add.txt
.
If you only wish to run the testsuite or add a new metric, you just need to install requirements.txt
by typing.
pip install -r requirements.txt
You also have to install the amr-utils
package.
If you wish to add new test cases that have not yet been evaluated by the metrics used in the testsuite, you will have to install requirements_add.txt with the following command:
pip install -r requirements_add.txt
Furthermore, you will need to download the 300 dimensional GloVe word embeddings in order to be able to test with S2match. They have to be located in a directory vectors
in the parent directory as well.
If you just want to run the testsuite with the avaliable metrics, you can simply go into the directory scripts
and execute the testsuite in the terminal by running the command:
python3 test_suite.py -m
You will get a short overview of the testsuite and the overall results on the console. The results of the testsuite for each phenomenon will be written into text files in the directory Results
.
In order to give you an impression of the testsuite's output, the directory Results
already contains the testsuite's results with the metrics listed below.
Note, that you can decide which metrics you want to include in your run of the testsuite by commenting the ones you don't want to consider with a '#'.
If you want to test your own metric, you will first need to compute the scores for each test case. You can find the data you need in data/content_test_cases.json
. If you like, you can use get_your_scores.py
as a template to implement your metric. It will output a JSON file with the test cases' IDs as keys and a dictionary containing the metric's score as value. You can even test multiple metrics at once by adding multiple scores to the score dictionary for each test case. Similar to the metrics.txt
file, you can choose which of these metrics you want to test by writing them into the my_metrics.txt
file.
Then run the following command in the scripts
directory:
python3 test_suite.py path_to_your_JSON
The data present in the testsuite was sampled from the following data sets:
- STS (Semantic Textual Similarity) Data Sets (2012-2017)
- SICK (Sentences Involving Compositional Knowldedge) Data Set
The metrics that are employed in the testsuite are the following:
- BERT Score
- BLEU
- chrF++
- Meteor
- S2match
- Smatch
- WLK
- WWLK
- GraCo (four variants)