[EACL 2024] ICE-Score: Instructing Large Language Models to Evaluate Code
-
Updated
Jun 16, 2024 - Python
[EACL 2024] ICE-Score: Instructing Large Language Models to Evaluate Code
Success and Failure Linguistic Simplification Annotation 💃
Multidimensional Evaluation for Text Style Transfer Using ChatGPT. Human Judgement as a Compass to Navigate Automatic Metrics for Formality Transfer (HumEval 2022)
MONSERRATE is a dataset specifically created to evaluate Question Generation systems. It has, on average, 26 questions associated to each source sentence, attempting to be an “exhaustive” reference.
Requirements-to-Running-Code benchmark for AI/LLM systems and frameworks—builds, runs, and auto-scores apps across functional and non-functional metrics.
Add a description, image, and links to the automatic-evaluation topic page so that developers can more easily learn about it.
To associate your repository with the automatic-evaluation topic, visit your repo's landing page and select "manage topics."