It’s not like the author didn’t provide details on how to calculate scores. The evaluation only generates dictionary results?