MATEval: A Multi-Agent Text Evaluation Framework
This paper has been ACCEPTED as a LONG PAPER presentation by DASFAA 2024 Industrial Track. You can currently access it through the following link: MATEval
In the Alipay business scenario, we need to assess open-ended story texts generated by large language models(LLMs). For this specific business context, we have proposed a multi-agent evaluation framework called "MATEval". Within this framework, we have integrated strategies of self-reflection and Chain-of-Thought (CoT), and we have also introduced a feedback mechanism at the end of each round of discussion. This mechanism evaluates the quality of each discussion round, facilitating consensus. Ultimately, we require a summarizer to consolidate the results of the entire discussion process. We provide two formats of output: one in the form of Q&A pairs, and the other as text reports that are easy for humans to read. Extensive experiments demonstrate that MATEval's evaluation results on two classic story datasets are more aligned with human preferences compared to existing methods.
In the MATEval framework, we select OpenAI’s GPT-4 as our LLMs due to its outstanding performance and API accessibility. We set the temperature parameter to 0 for result reproducibility. GPT-4’s easy access facilitated effective and coherent multi-agent interactions in our experiments.
