Description
Hi, I have observed a particular situation with the SARI implementation where system outputs can receive a <100 score even when they are identical to the reference (where there is only a single reference).
Basically, if a reference does not introduce new tokens, it will receive a 0.00 unigram add-score, but 100 for all n>1-grams.
Take the following example:
sources=["Shu Abe (born June 7 1984) is a former Japanese football player."]
predictions=["Shu Abe (born June 7 1984) is a Japanese football player."]
references=[["Shu Abe (born June 7 1984) is a Japanese football player."]]
sari_score = corpus_sari(sources, predictions, references)
print(sari_score)
>>> 91.66666666666667
In this case, the add score will be 75.0 because there are no new unigrams (because of the if sys_total > 0:
checks in compute_precision_recall_f1()
) but there are technically new bigrams, trigrams, and 4-grams around the location of the deleted word (["a japanese", "a japanese football", "is a japanese"]
, etc.).
I am just curious of whether this is the expected behaviour or if a definitive 0.00 or 100.0 result for the add-score would be more desirable?
Thanks in advance for any insight.