- Project page for our paper "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models"
- We investigated several variants of the selfcheck approach: BERTScore, Question-Answering, n-gram, NLI, and LLM-Prompting.
- [Nov 2023] SelfCheckGPT-NLI Calibration Analysis thanks to Daniel Huynh [Link to Article]
- [Oct 2023] The paper is accepted and to appear at EMNLP 2023 [Poster]
- [Aug 2023] Slides from ML Collective Talk [Link to Slides]
pip install selfcheckgpt
There are three variants of SelfCheck scores in this package as described in the paper: SelfCheckBERTScore()
, SelfCheckMQAG()
, SelfCheckNgram()
. All of the variants have predict()
which will output the sentence-level scores w.r.t. sampled passages. You can use packages such as spacy to split passage into sentences. For reproducibility, you can set torch.manual_seed
before calling this function. See more details in Jupyter Notebook demo/SelfCheck_demo1.ipynb
# Include necessary packages (torch, spacy, ...)
from selfcheckgpt.modeling_selfcheck import SelfCheckMQAG, SelfCheckBERTScore, SelfCheckNgram
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
selfcheck_mqag = SelfCheckMQAG(device=device) # set device to 'cuda' if GPU is available
selfcheck_bertscore = SelfCheckBERTScore(rescale_with_baseline=True)
selfcheck_ngram = SelfCheckNgram(n=1) # n=1 means Unigram, n=2 means Bigram, etc.
# LLM's text (e.g. GPT-3 response) to be evaluated at the sentence level & Split it into sentences
passage = "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Nation."
sentences = [sent.text.strip() for sent in nlp(passage).sents] # spacy sentence tokenization
print(sentences)
['Michael Alan Weiner (born March 31, 1942) is an American radio host.', 'He is the host of The Savage Nation.']
# Other samples generated by the same LLM to perform self-check for consistency
sample1 = "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Country."
sample2 = "Michael Alan Weiner (born January 13, 1960) is a Canadian radio host. He works at The New York Times."
sample3 = "Michael Alan Weiner (born March 31, 1942) is an American radio host. He obtained his PhD from MIT."
# --------------------------------------------------------------------------------------------------------------- #
# SelfCheck-MQAG: Score for each sentence where value is in [0.0, 1.0] and high value means non-factual
# Additional params for each scoring_method:
# -> counting: AT (answerability threshold, i.e. questions with answerability_score < AT are rejected)
# -> bayes: AT, beta1, beta2
# -> bayes_with_alpha: beta1, beta2
sent_scores_mqag = selfcheck_mqag.predict(
sentences = sentences, # list of sentences
passage = passage, # passage (before sentence-split)
sampled_passages = [sample1, sample2, sample3], # list of sampled passages
num_questions_per_sent = 5, # number of questions to be drawn
scoring_method = 'bayes_with_alpha', # options = 'counting', 'bayes', 'bayes_with_alpha'
beta1 = 0.8, beta2 = 0.8, # additional params depending on scoring_method
)
print(sent_scores_mqag)
# [0.30990949 0.42376232]
# --------------------------------------------------------------------------------------------------------------- #
# SelfCheck-BERTScore: Score for each sentence where value is in [0.0, 1.0] and high value means non-factual
sent_scores_bertscore = selfcheck_bertscore.predict(
sentences = sentences, # list of sentences
sampled_passages = [sample1, sample2, sample3], # list of sampled passages
)
print(sent_scores_bertscore)
# [0.0695562 0.45590915]
# --------------------------------------------------------------------------------------------------------------- #
# SelfCheck-Ngram: Score at sentence- and document-level where value is in [0.0, +inf) and high value means non-factual
# as opposed to SelfCheck-MQAG and SelfCheck-BERTScore, SelfCheck-Ngram's score is not bounded
sent_scores_ngram = selfcheck_ngram.predict(
sentences = sentences,
passage = passage,
sampled_passages = [sample1, sample2, sample3],
)
print(sent_scores_ngram)
# {'sent_level': { # sentence-level score similar to MQAG and BERTScore variant
# 'avg_neg_logprob': [3.184312, 3.279774],
# 'max_neg_logprob': [3.476098, 4.574710]
# },
# 'doc_level': { # document-level score such that avg_neg_logprob is computed over all tokens
# 'avg_neg_logprob': 3.218678904916201,
# 'avg_max_neg_logprob': 4.025404834169327
# }
# }
Entailment (or Contradiction) score with input being the sentence and a sampled passage can be used as the selfcheck score. We use DeBERTa-v3-large fine-tuned to Multi-NLI, and we normalize the probability of "entailment" or "contradiction" classes, and take Prob(contradiction) as the score.
from selfcheckgpt.modeling_selfcheck import SelfCheckNLI
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
selfcheck_nli = SelfCheckNLI(device=device) # set device to 'cuda' if GPU is available
sent_scores_nli = selfcheck_nli.predict(
sentences = sentences, # list of sentences
sampled_passages = [sample1, sample2, sample3], # list of sampled passages
)
print(sent_scores_nli)
# [0.334014 0.975106 ] -- based on the example above
Prompting an LLM (Llama2, Mistral, OpenAI's GPT) to assess information consistency in a zero-shot setup. We query an LLM to assess whether the i-th sentence is supported by the sample (as the context). Similar to other methods, a higher score indicates higher chance of being hallucination. An example when using Mistral is below:
# Option1: open-source model
from selfcheckgpt.modeling_selfcheck import SelfCheckLLMPrompt
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
llm_model = "mistralai/Mistral-7B-Instruct-v0.2"
selfcheck_prompt = SelfCheckLLMPrompt(llm_model, device)
# Option2: API access
# (currently only support OpenAI and Groq)
# from selfcheckgpt.modeling_selfcheck_apiprompt import SelfCheckAPIPrompt
# selfcheck_prompt = SelfCheckAPIPrompt(client_type="openai", model="gpt-3.5-turbo")
# selfcheck_prompt = SelfCheckAPIPrompt(client_type="groq", model="llama3-70b-8192", api_key="your-api-key")
sent_scores_prompt = selfcheck_prompt.predict(
sentences = sentences, # list of sentences
sampled_passages = [sample1, sample2, sample3], # list of sampled passages
verbose = True, # whether to show a progress bar
)
print(sent_scores_prompt)
# [0.33333333, 0.66666667] -- based on the example above
The LLM can be any model available on HuggingFace. The default prompt template is Context: {context}\n\nSentence: {sentence}\n\nIs the sentence supported by the context above? Answer Yes or No.\n\nAnswer:
, but you can change it using selfcheck_prompt.set_prompt_template(new_prompt)
.
Most models (gpt-3.5-turbo, Llama2, Mistral) will output either 'Yes' or 'No' >95% of the time, while any remaining outputs can be set to N/A. The output is converted to score: Yes -> 0.0, No -> 1.0, N/A -> 0.5. The inconsistency score is then calculated by averaging.
The wiki_bio_gpt3_hallucination
dataset currently consists of 238 annotated passages (v3
). You can find more information in the paper or our data card on HuggingFace: https://huggingface.co/datasets/potsawee/wiki_bio_gpt3_hallucination. To use this dataset, you can either load it through HuggingFace dataset API, or download it directly from below in the JSON format.
We've annotated GPT-3 wikibio passages further, and now the dataset consists of 238 annotated passages. Here is the link for the IDs of the first 65 passages in the v1
.
from datasets import load_dataset
dataset = load_dataset("potsawee/wiki_bio_gpt3_hallucination")
Download from our Google Drive, then you can load it in python:
import json
with open("dataset.json", "r") as f:
content = f.read()
dataset = json.loads(content)
Each instance consists of:
gpt3_text
: GPT-3 generated passagewiki_bio_text
: Actual Wikipedia passage (first paragraph)gpt3_sentences
:gpt3_text
split into sentences usingspacy
annotation
: human annotation at the sentence levelwiki_bio_test_idx
: ID of the concept/individual from the original wikibio dataset (testset)gpt3_text_samples
: list of sampled passages (do_sample = True & temperature = 1.0)
As described in our paper, probabilities (and generation entropies) of the generative LLM can be used to measure its confidence. Check our example/implementation of this approach in demo/experiments/probability-based-baselines.ipynb
- Full details can be found in our paper.
- Note that our new results show that LLMs such as GPT-3 (text-davinci-003) or ChatGPT (gpt-3.5-turbo) are good at text inconsistency assessment. Based on this finding, we try SelfCheckGPT-Prompt where each sentence (to be evaluated) is compared against each and every sampled_passage by prompting ChatGPT. SelfCheckGPT-Prompt is the best-performing method.
Results on the wiki_bio_gpt3_hallucination
dataset.
Method | NonFact (AUC-PR) | Factual (AUC-PR) | Ranking (PCC) |
---|---|---|---|
Random Guessing | 72.96 | 27.04 | - |
GPT-3 Avg(-logP) | 83.21 | 53.97 | 57.04 |
SelfCheck-BERTScore | 81.96 | 44.23 | 58.18 |
SelfCheck-QA | 84.26 | 48.14 | 61.07 |
SelfCheck-Unigram | 85.63 | 58.47 | 64.71 |
SelfCheck-NLI | 92.50 | 66.08 | 74.14 |
SelfCheck-Prompt (Llama2-7B-chat) | 89.05 | 63.06 | 61.52 |
SelfCheck-Prompt (Llama2-13B-chat) | 91.91 | 64.34 | 75.44 |
SelfCheck-Prompt (Mistral-7B-Instruct-v0.2) | 91.31 | 62.76 | 74.46 |
SelfCheck-Prompt (gpt-3.5-turbo) | 93.42 | 67.09 | 78.32 |
MQAG (Multiple-choice Question Answering and Generation) was proposed in our previous work. Our MQAG implementation is included in this package, which can be used to: (1) generate multiple-choice questions, (2) answer multiple-choice questions, (3) obtain MQAG score.
from selfcheckgpt.modeling_mqag import MQAG
mqag_model = MQAG()
It has three main functions: generate()
, answer()
, score()
. We show an example usage in demo/MQAG_demo1.ipynb
This work is supported by Cambridge University Press & Assessment (CUP&A), a department of The Chancellor, Masters, and Scholars of the University of Cambridge, and the Cambridge Commonwealth, European & International Trust.
@article{manakul2023selfcheckgpt,
title={Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models},
author={Manakul, Potsawee and Liusie, Adian and Gales, Mark JF},
journal={arXiv preprint arXiv:2303.08896},
year={2023}
}