This repository contains the codes and packages for the paper titled ExPerT: Effective and Explainable Evaluation of Personalized Long-Form Text Generation.
Evaluating personalized text generated by large language models (LLMs) is challenging, as only the LLM user, i.e. prompt author, can reliably assess the output, but re-engaging the same individuals across studies is infeasible. This paper addresses the challenge of evaluating personalized text generation by introducing ExPerT, an explainable reference-based evaluation framework. ExPerT leverages an LLM to extract atomic aspects and their evidences from the generated and reference texts, match the aspects, and evaluate their alignment based on content and writing style—two key attributes in personalized text generation. Additionally, ExPerT generates detailed, fine-grained explanations for every step of the evaluation process, enhancing transparency and interpretability. Our experiments demonstrate that ExPerT achieves a 7.2% relative improvement in alignment with human judgments compared to the state-of-the-art text generation evaluation methods. Furthermore, human evaluators rated the usability of ExPerT's explanations at 4.7 out of 5, highlighting its effectiveness in making evaluation decisions more interpretable.
You can install ExPerT using the following pip command:
pip install expert-score==0.0.1
Using ExPerT is as simple as:
import expert_score
score = expert_score.expert(
inputs = [...], # A list of input strings from the users
outputs = [...], # A list of generated outputs by a model for the users
references = [...], # A list of reference outputs for the users
model_name = "google/gemma-2-27b-it", # The name of the LLM to be used as ExPerT's backbone
cache_dir = "/path/to/cache/dir", # The cache directory
max_generated_output_length = 512, # Maximum number of tokens to consider from the generated outputs
max_evaluator_length = 8192, # Maximum number of tokens that can be used by ExPerT's LLM backbone
max_retries = 100, # Maximum retries before failure for out-of-format generated outputs by ExPerT's LLM backbone
ignore_on_fail = True, # Ignore single aspects if the model fails to generate well-formatted output (rare occurrence)
google_llm = False, # If you want to use a Google-based LLM API
openai_llm = False, # If you want to use an OpenAI-based LLM API
api_key = "api/key", # The API key for LLM API if using OpenAI or Google LLM API
)
You can see an example of ExPerT's evaluation in this notebook.
ExPerT: Effective and Explainable Evaluation of Personalized Long-Form Text Generation
@misc{salemi2025experteffectiveexplainableevaluation,
title={ExPerT: Effective and Explainable Evaluation of Personalized Long-Form Text Generation},
author={Alireza Salemi and Julian Killingback and Hamed Zamani},
year={2025},
eprint={2501.14956},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.14956},
}
This work was supported in part by the Center for Intelligent Information Retrieval, in part by the NSF Graduate Research Fellowships Program (GRFP) Award #1938059, in part by Google, and in part by Microsoft. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.