This is the official repository of our ACL 2024 Findings paper:
TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models
Please cite our work if you found the resources in this repository useful:
@inproceedings{ahn2024timechara,
title={TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models},
author={Jaewoo Ahn and Taehyun Lee and Junyoung Lim and Jin-Hwa Kim and Sangdoo Yun and Hwaran Lee and Gunhee Kim},
booktitle={Findings of ACL},
year=2024
}
For a brief summary of our paper, please see this webpage.
You can load TimeChara from the HuggingFace hub as the following:
from datasets import load_dataset
dataset = load_dataset("ahnpersie/timechara")
Details on TimeChara
(1) Validation set (600 examples): Randomly sampled 600 examples from the test set.
(2) Test set (10,895 examples): All datasets, including the validation set.
(3) We provide create_dataset.py
to automatically construct TimeChara. Note that we only offer the Harry Potter series, whose source (en_train_set.json
) can be obtained from the HPD dataset.
python create_dataset.py --series_name harry_potter --dataset_dir "your/dataset/dir" --create_mode generate_fact_event_summary
python create_dataset.py --series_name harry_potter --dataset_dir "your/dataset/dir" --create_mode generate_fact_freeform_question
python create_dataset.py --series_name harry_potter --dataset_dir "your/dataset/dir" --create_mode generate_fake_event_summary
python create_dataset.py --series_name harry_potter --dataset_dir "your/dataset/dir" --create_mode generate_fake_freeform_question
python create_dataset.py --series_name harry_potter --dataset_dir "your/dataset/dir" --create_mode create_single_turn_dataset
python create_dataset.py --series_name harry_potter --dataset_dir "your/dataset/dir" --create_mode generate_gold_response
(3-1) To use the OpenAI API for GPT-4, you need to export your OPENAI_API_KEY:
export OPENAI_API_KEY='your-openai-api-key'
Usage restrictions: TimeChara should only be used for non-commercial research. For more details, refer to the Ethics Statement in the paper.
We recommend using Anaconda. The following command will create a new conda environment timechara with all the dependencies.
conda env create -f environment.yml
To activate the environment:
conda activate timechara
First, you can generate a response given a question by running the following command:
python generate.py --model_name gpt-4o-2024-05-13 --method_name zero-shot
# python generate.py --model_name gpt-4o-2024-05-13 --method_name zero-shot-cot
# python generate.py --model_name gpt-4o-2024-05-13 --method_name few-shot
# python generate.py --model_name gpt-4o-2024-05-13 --method_name self-refine
# python generate.py --model_name gpt-4o-2024-05-13 --method_name rag-cutoff
# python generate.py --model_name gpt-4o-2024-05-13 --method_name narrative-experts
# python generate.py --model_name gpt-4o-2024-05-13 --method_name narrative-experts-rag-cutoff
Details on Generation
(1) To use the OpenAI API (for either GPT models or the RAG method), you need to export your OPENAI_API_KEY:
export OPENAI_API_KEY='your-openai-api-key'
(2) To use RAG, you should manually download the Chroma DB files directly by clicking this link:
unzip chroma_db_files.zip
mv text-embedding-ada-002 methods/rag
Finally, you can evaluate a response by running the following command:
python evaluate.py --eval_model_name gpt-4-1106-preview --model_name gpt-4o-2024-05-13 --method_name zero-shot
Details on Evaluation
(1) Since we don't support AlignScore directly, use an independent GitHub repository (AlignScore) to evaluate generated responses via AlignScore instead of GPT-4 judges:
from alignscore import AlignScore
scorer = AlignScore(model='roberta-large', batch_size=32, device='cuda:0', ckpt_path='/path/to/checkpoint', evaluation_mode='nli_sp')
scores = scorer.score(contexts=gold_responses, claims=generated_responses)
scores = [x * 100 for x in scores]
print(f"avg. AlignScore (# {len(scores)}) = {sum(scores)/len(scores)}")
All generation & evaluation results will be saved under outputs
.
We present the spatiotemporal consistency results for the newer models on the validation set, ranked by the Average
scores.
Model | Average [%] | Future [%] | Past-absence [%] | Past-presence [%] | Past-only [%] |
---|---|---|---|---|---|
o1-2024-12-17 (zero-shot) | 81.8 | 80.5 | 81.0 | 93.0 | 78.0 |
o1-preview-2024-09-12 (zero-shot) | 80.5 | 82.5 | 83.0 | 88.0 | 73.5 |
GPT-4o-2024-05-13 (zero-shot) | 64.5 | 46.0 | 74.0 | 90.0 | 65.5 |
GPT-4-turbo-1106-preview (zero-shot) | 62.7 | 46.5 | 75.0 | 90.0 | 59.0 |
Mistral-7b-instruct-v0.2 (zero-shot) | 46.8 | 44.5 | 53.0 | 63.0 | 38.0 |
GPT-3.5-turbo-1106 (zero-shot) | 44.2 | 29.0 | 33.0 | 91.0 | 41.5 |
Please contact Jaewoo Ahn at jaewoo.ahn at vision.snu.ac.kr
This repository is MIT licensed. See the LICENSE file for details.