SelfCheckGPT

Project page for our paper "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models"
We investigated several variants of the selfcheck approach: BERTScore, Question-Answering, n-gram, NLI, and LLM-Prompting.
[Nov 2023] SelfCheckGPT-NLI Calibration Analysis thanks to Daniel Huynh [Link to Article]
[Oct 2023] The paper is accepted and to appear at EMNLP 2023 [Poster]
[Aug 2023] Slides from ML Collective Talk [Link to Slides]

Code/Package

Installation

pip install selfcheckgpt

SelfCheckGPT Usage: BERTScore, QA, n-gram

There are three variants of SelfCheck scores in this package as described in the paper: SelfCheckBERTScore(), SelfCheckMQAG(), SelfCheckNgram(). All of the variants have predict() which will output the sentence-level scores w.r.t. sampled passages. You can use packages such as spacy to split passage into sentences. For reproducibility, you can set torch.manual_seed before calling this function. See more details in Jupyter Notebook demo/SelfCheck_demo1.ipynb

# Include necessary packages (torch, spacy, ...)
from selfcheckgpt.modeling_selfcheck import SelfCheckMQAG, SelfCheckBERTScore, SelfCheckNgram
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
selfcheck_mqag = SelfCheckMQAG(device=device) # set device to 'cuda' if GPU is available
selfcheck_bertscore = SelfCheckBERTScore(rescale_with_baseline=True)
selfcheck_ngram = SelfCheckNgram(n=1) # n=1 means Unigram, n=2 means Bigram, etc.

# LLM's text (e.g. GPT-3 response) to be evaluated at the sentence level  & Split it into sentences
passage = "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Nation."
sentences = [sent.text.strip() for sent in nlp(passage).sents] # spacy sentence tokenization
print(sentences)
['Michael Alan Weiner (born March 31, 1942) is an American radio host.', 'He is the host of The Savage Nation.']

# Other samples generated by the same LLM to perform self-check for consistency
sample1 = "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Country."
sample2 = "Michael Alan Weiner (born January 13, 1960) is a Canadian radio host. He works at The New York Times."
sample3 = "Michael Alan Weiner (born March 31, 1942) is an American radio host. He obtained his PhD from MIT."

# --------------------------------------------------------------------------------------------------------------- #
# SelfCheck-MQAG: Score for each sentence where value is in [0.0, 1.0] and high value means non-factual
# Additional params for each scoring_method:
# -> counting: AT (answerability threshold, i.e. questions with answerability_score < AT are rejected)
# -> bayes: AT, beta1, beta2
# -> bayes_with_alpha: beta1, beta2
sent_scores_mqag = selfcheck_mqag.predict(
    sentences = sentences,               # list of sentences
    passage = passage,                   # passage (before sentence-split)
    sampled_passages = [sample1, sample2, sample3], # list of sampled passages
    num_questions_per_sent = 5,          # number of questions to be drawn  
    scoring_method = 'bayes_with_alpha', # options = 'counting', 'bayes', 'bayes_with_alpha'
    beta1 = 0.8, beta2 = 0.8,            # additional params depending on scoring_method
)
print(sent_scores_mqag)
# [0.30990949 0.42376232]

# --------------------------------------------------------------------------------------------------------------- #
# SelfCheck-BERTScore: Score for each sentence where value is in [0.0, 1.0] and high value means non-factual
sent_scores_bertscore = selfcheck_bertscore.predict(
    sentences = sentences,                          # list of sentences
    sampled_passages = [sample1, sample2, sample3], # list of sampled passages
)
print(sent_scores_bertscore)
# [0.0695562  0.45590915]

# --------------------------------------------------------------------------------------------------------------- #
# SelfCheck-Ngram: Score at sentence- and document-level where value is in [0.0, +inf) and high value means non-factual
# as opposed to SelfCheck-MQAG and SelfCheck-BERTScore, SelfCheck-Ngram's score is not bounded
sent_scores_ngram = selfcheck_ngram.predict(
    sentences = sentences,   
    passage = passage,
    sampled_passages = [sample1, sample2, sample3],
)
print(sent_scores_ngram)
# {'sent_level': { # sentence-level score similar to MQAG and BERTScore variant
#     'avg_neg_logprob': [3.184312, 3.279774],
#     'max_neg_logprob': [3.476098, 4.574710]
#     },
#  'doc_level': {  # document-level score such that avg_neg_logprob is computed over all tokens
#     'avg_neg_logprob': 3.218678904916201,
#     'avg_max_neg_logprob': 4.025404834169327
#     }
# }

SelfCheckGPT Usage: NLI (recommended)

Entailment (or Contradiction) score with input being the sentence and a sampled passage can be used as the selfcheck score. We use DeBERTa-v3-large fine-tuned to Multi-NLI, and we normalize the probability of "entailment" or "contradiction" classes, and take Prob(contradiction) as the score.

from selfcheckgpt.modeling_selfcheck import SelfCheckNLI
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
selfcheck_nli = SelfCheckNLI(device=device) # set device to 'cuda' if GPU is available

sent_scores_nli = selfcheck_nli.predict(
    sentences = sentences,                          # list of sentences
    sampled_passages = [sample1, sample2, sample3], # list of sampled passages
)
print(sent_scores_nli)
# [0.334014 0.975106 ] -- based on the example above

SelfCheckGPT Usage: LLM Prompt

Prompting an LLM (Llama2, Mistral, OpenAI's GPT) to assess information consistency in a zero-shot setup. We query an LLM to assess whether the i-th sentence is supported by the sample (as the context). Similar to other methods, a higher score indicates higher chance of being hallucination. An example when using Mistral is below:

# Option1: open-source model
from selfcheckgpt.modeling_selfcheck import SelfCheckLLMPrompt
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
llm_model = "mistralai/Mistral-7B-Instruct-v0.2"
selfcheck_prompt = SelfCheckLLMPrompt(llm_model, device)

# Option2: API access 
# (currently only support OpenAI and Groq)
# from selfcheckgpt.modeling_selfcheck_apiprompt import SelfCheckAPIPrompt
# selfcheck_prompt = SelfCheckAPIPrompt(client_type="openai", model="gpt-3.5-turbo")
# selfcheck_prompt = SelfCheckAPIPrompt(client_type="groq", model="llama3-70b-8192", api_key="your-api-key")

sent_scores_prompt = selfcheck_prompt.predict(
    sentences = sentences,                          # list of sentences
    sampled_passages = [sample1, sample2, sample3], # list of sampled passages
    verbose = True, # whether to show a progress bar
)
print(sent_scores_prompt)
# [0.33333333, 0.66666667] -- based on the example above

The LLM can be any model available on HuggingFace. The default prompt template is Context: {context}\n\nSentence: {sentence}\n\nIs the sentence supported by the context above? Answer Yes or No.\n\nAnswer: , but you can change it using selfcheck_prompt.set_prompt_template(new_prompt).

Most models (gpt-3.5-turbo, Llama2, Mistral) will output either 'Yes' or 'No' >95% of the time, while any remaining outputs can be set to N/A. The output is converted to score: Yes -> 0.0, No -> 1.0, N/A -> 0.5. The inconsistency score is then calculated by averaging.

Dataset

The wiki_bio_gpt3_hallucination dataset currently consists of 238 annotated passages (v3). You can find more information in the paper or our data card on HuggingFace: https://huggingface.co/datasets/potsawee/wiki_bio_gpt3_hallucination. To use this dataset, you can either load it through HuggingFace dataset API, or download it directly from below in the JSON format.

Update

We've annotated GPT-3 wikibio passages further, and now the dataset consists of 238 annotated passages. Here is the link for the IDs of the first 65 passages in the v1.

Option1: HuggingFace

from datasets import load_dataset
dataset = load_dataset("potsawee/wiki_bio_gpt3_hallucination")

Option2: Manual Download

Download from our Google Drive, then you can load it in python:

import json
with open("dataset.json", "r") as f:
    content = f.read()
dataset = json.loads(content)

Each instance consists of:

gpt3_text: GPT-3 generated passage
wiki_bio_text: Actual Wikipedia passage (first paragraph)
gpt3_sentences: gpt3_text split into sentences using spacy
annotation: human annotation at the sentence level
wiki_bio_test_idx: ID of the concept/individual from the original wikibio dataset (testset)
gpt3_text_samples: list of sampled passages (do_sample = True & temperature = 1.0)

Experiments

Probability-based baselines (e.g. GPT-3's probabilities)

As described in our paper, probabilities (and generation entropies) of the generative LLM can be used to measure its confidence. Check our example/implementation of this approach in demo/experiments/probability-based-baselines.ipynb

Experimental Results

Full details can be found in our paper.
Note that our new results show that LLMs such as GPT-3 (text-davinci-003) or ChatGPT (gpt-3.5-turbo) are good at text inconsistency assessment. Based on this finding, we try SelfCheckGPT-Prompt where each sentence (to be evaluated) is compared against each and every sampled_passage by prompting ChatGPT. SelfCheckGPT-Prompt is the best-performing method.

Results on the wiki_bio_gpt3_hallucination dataset.

Method	NonFact (AUC-PR)	Factual (AUC-PR)	Ranking (PCC)
Random Guessing	72.96	27.04	-
GPT-3 Avg(-logP)	83.21	53.97	57.04
SelfCheck-BERTScore	81.96	44.23	58.18
SelfCheck-QA	84.26	48.14	61.07
SelfCheck-Unigram	85.63	58.47	64.71
SelfCheck-NLI	92.50	66.08	74.14
SelfCheck-Prompt (Llama2-7B-chat)	89.05	63.06	61.52
SelfCheck-Prompt (Llama2-13B-chat)	91.91	64.34	75.44
SelfCheck-Prompt (Mistral-7B-Instruct-v0.2)	91.31	62.76	74.46
SelfCheck-Prompt (gpt-3.5-turbo)	93.42	67.09	78.32

Miscellaneous

MQAG (Multiple-choice Question Answering and Generation) was proposed in our previous work. Our MQAG implementation is included in this package, which can be used to: (1) generate multiple-choice questions, (2) answer multiple-choice questions, (3) obtain MQAG score.

MQAG Usage

from selfcheckgpt.modeling_mqag import MQAG
mqag_model = MQAG()

It has three main functions: generate(), answer(), score(). We show an example usage in demo/MQAG_demo1.ipynb

Acknowledgements

This work is supported by Cambridge University Press & Assessment (CUP&A), a department of The Chancellor, Masters, and Scholars of the University of Cambridge, and the Cambridge Commonwealth, European & International Trust.

Citation

@article{manakul2023selfcheckgpt,
  title={Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models},
  author={Manakul, Potsawee and Liusie, Adian and Gales, Mark JF},
  journal={arXiv preprint arXiv:2303.08896},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
demo		demo
selfcheckgpt		selfcheckgpt
.gitignore		.gitignore
DESCRIPTION.txt		DESCRIPTION.txt
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SelfCheckGPT

Code/Package

Installation

SelfCheckGPT Usage: BERTScore, QA, n-gram

SelfCheckGPT Usage: NLI (recommended)

SelfCheckGPT Usage: LLM Prompt

Dataset

Update

Option1: HuggingFace

Option2: Manual Download

Experiments

Probability-based baselines (e.g. GPT-3's probabilities)

Experimental Results

Miscellaneous

MQAG Usage

Acknowledgements

Citation

About

Releases 6

Packages

Contributors 3

Languages

License

potsawee/selfcheckgpt

Folders and files

Latest commit

History

Repository files navigation

SelfCheckGPT

Code/Package

Installation

SelfCheckGPT Usage: BERTScore, QA, n-gram

SelfCheckGPT Usage: NLI (recommended)

SelfCheckGPT Usage: LLM Prompt

Dataset

Update

Option1: HuggingFace

Option2: Manual Download

Experiments

Probability-based baselines (e.g. GPT-3's probabilities)

Experimental Results

Miscellaneous

MQAG Usage

Acknowledgements

Citation

About

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Contributors 3

Languages

Packages