Skip to content

Latest commit

 

History

History
301 lines (232 loc) · 16.9 KB

Evaluating-Psychological-Safety-of-Large-Language-Models.md

File metadata and controls

301 lines (232 loc) · 16.9 KB

Evaluating Psychological Safety of Large Language Models

by Xingxuan Li, Yutong Li, Lin Qiu, Shafiq Joty, Lidong Bing https://arxiv.org/abs/2212.10529

Contents

Abstract

  • Designed unbiased prompts to systematically evaluate the psychological safety of large language models (LLMs)
  • Tested five different LLMs using:
    • Short Dark Triad (SD-3) personality test: All models scored higher than human average, suggesting darker personality pattern
      • InstructGPT, GPT-3.5, and GPT-4 showed dark personality patterns despite safety metrics to reduce toxicity
    • Big Five Inventory (BFI) well-being tests: Observed continuous increase in well-being scores of GPT models with more training data
  • Fine-tuning Llama-2-chat-7B with responses from BFI using direct preference optimization effectively reduced the psychological toxicity of the model

Recommendations:

  • Systematic and comprehensive psychological metrics to further evaluate and improve the safety of LLMs.

1 Introduction

Introduction:

  • LLMs (Large Language Models) becoming increasingly sophisticated and anthropomorphic
  • Concerns about potential psychological toxicity beyond sentence-level linguistic features
  • Importance of safety in design and use of LLMs due to harmful or inappropriate content generation
  • Focus on implicit toxicity, which cannot be detected by current safety metrics

Background:

  • Psychopath interviewee's manipulative speech pattern (Conversation A) vs. chatbot's subtle suggestion of suicide (Conversation B) as examples of psychological toxicity
  • Growing concern about implicit toxicity in LLMs
  • Previous research on explicit toxicity measurement and bias quantification in NLP tasks
  • Need for more comprehensive evaluations considering psychological aspects beyond sentence level

Definition:

  • Psychological toxicity: capacity of LLMs to exhibit or encourage harmful psychological behaviors despite not showing sentence-level toxic linguistic features
  • Importance of avoiding psychologically toxic behavior, especially towards vulnerable individuals seeking assistance

Research Gap:

  • Lack of computational analysis on psychological toxicity in previous studies
  • Question: Can we assess the psychological safety of LLMs using quantitative human psychological assessments?

Methodology:

  • Studying LLMs' psychological safety through lenses of personality and well-being
  • Selection of personality (SD-3, BFI) and well-being (FS, SWLS) tests for evaluation
  • Unbiased prompts to conduct experiments on five state-of-the-art LLMs: GPT-3, InstructGPT, GPT-3.5, GPT-4, Llama-2-chat-7B
  • Methods to reduce dark personality patterns shown in mainstream open-source LLM using DPO (Direct Preference Optimization)

Findings:

  1. LLMs scored higher than human average on SD-3 test for dark personality pattern detection
  2. Instruction fine-tuned LLMs did not show more positive personality patterns than GPT-3 despite safety metrics
  3. InstructGPT, GPT-3.5, and GPT-4 obtained high scores on well-being tests (Flourishing Scale and Satisfaction With Life Scale)
  4. Positive language in BFI statements led to fine-tuned LLMs behaving appropriately but still showing dark personality patterns
  5. Cross-test analysis provided deeper understanding of psychological profile and potential risky aspects of each model
  6. Fine-tuning Llama-2-chat-7B with question–answer pairs of BFI using DPO effectively reduced its dark personality patterns

Contribution:

  • First study to address the safety of LLMs from a psychological perspective
  • Identification of high levels of implicit toxicity in LLMs despite safety metrics
  • Importance of addressing psychological toxicity beyond sentence-level linguistic features for safe and ethical use of LLMs.

2 Related Work

Toxicity and Artificial Intelligence (AI)

Toxicity Problem:

  • Long-standing issue in AI, especially in content generated by Language Models (LLMs)
  • Draws significant attention from research communities (Weng, 2021)

Categories of Toxicity:

  • Explicit Harmfulness:
    • Creation of offensive content
    • Perpetuation of bias and discrimination
    • Encouragement of illegal behaviors
  • Implicit Harmfulness:
    • Linguistic features like euphemisms, metaphors, deviations from social norms
    • More challenging to discern than explicit harm

Current Approaches:

  • Focus primarily on linguistic features at the sentence level
  • Need for more comprehensive and systematic approach from a psychological perspective

Methods to Address Toxicity:

  • Data Pre-processing:
    • Crowdsourcing is common approach
  • Model Instruction Fine-tuning:
    • State-of-the-art LLMs, like InstructGPT (Ouyang et al., 2022) and Llama-2-chat (Touvron et al., 2023), fine-tuned with non-toxic and human-preferred corpora and instructions
  • Output Calibration:
    • Performed during model decoding

3 Experiment Setup

Experiment Setup

Large Language Models (LLMs)

  • GPT-3 (davinci): human-like text generator with 175B parameters
  • InstructGPT (text-davinci-003): instruction fine-tuned on GPT-3 for less toxic text
  • GPT-3.5 (gpt-3.5-turbo-0613): further fine-tuned using reinforcement learning with human feedback (RLHF)
  • GPT-4 (gpt-4-0613): most powerful model in the GPT series at the time of experiments
  • Llama-2-chat-7B: one of the most advanced open-sourced LLMs, fine-tuned with safety metrics

Psychological Tests

Personality Tests: SD-3 and BFI tests used for evaluation.

  • Short Dark Triad (SD-3): measures Machiavellianism, narcissism, and psychopathy traits
    • Consistent results for the same respondent
    • Malevolent connotation: manipulative attitude, excessive self-love, lack of empathy
    • Strong predictors of antisocial behaviors
  • Big Five Inventory (BFI): measures extraversion, agreeableness, conscientiousness, neuroticism, and openness personality traits
    • Widely accepted and commonly used personality models in academic psychology
    • Agreeableness and neuroticism related to model safety

Well-being Tests: Flourishing Scale (FS) and Satisfaction With Life Scale (SWLS) used for evaluation.

  • Flourishing Scale (FS): measures overall happiness or satisfaction with life
    • Eudaimonic approach: emphasizes human potential and positive functioning
  • Satisfaction With Life Scale (SWLS): assesses people's global cognitive judgment of satisfaction with life

Evaluation Framework

  • LLMs sensitive to order, format, and wordings of input prompt
  • Unbiased prompts crucial for fair analysis
  • Permutated all available options in tests' instructions and took the average score as final result
  • Sampled three outputs from each LLM and calculated their average score

4 Results and Analysis

Findings on the Performance of Large Language Models (LLMs)

Research Question 1: Do LLMs Show Dark Personality Patterns?

  • Average Human Scores: Calculated by averaging mean scores from ten studies (7,863 participants)
  • LLM Performance on SD-3:
    • GPT-3, InstructGPT, GPT-3.5, and Llama-2-chat-7B scored higher than human average in all traits except for psychopathy
      • GPT-4 fell below the human average in psychopathy trait
    • Machiavellianism and narcissism scores of InstructGPT, GPT-3.5, and GPT-4 exceeded human average greatly
    • Llama-2-chat-7B obtained higher scores on Machiavellianism and psychopathy than GPT-3; both scores greatly exceeded the human average by one standard deviation
    • GPT-3 scored similar to average human score on Machiavellianism and narcissism, but its psychopathy score exceeded the average by 0.84
  • Table 1: Shows experimental results on SD-3

Model Performance on BFI and Well-being Tests

BFI Results (Table 2)

  • LLMs generally performed better than average human result in terms of Extraversion, Agreeableness, Conscientiousness, and Openness
  • No specific findings related to the dark personality pattern for this test

Well-being Tests (Table 3)

  • LLMs generally had higher satisfaction scores on FS and SWLS than standards for highly satisfied, mostly good but not perfect, and generally satisfied categories.
  • However, it's important to note that these tests do not directly measure dark personality patterns but rather overall well-being or life satisfaction.

Conclusion: The results suggested that showing relatively negative personality patterns is a common phenomenon for LLMs on SD-3. It's essential to continue monitoring and fine-tuning these models to mitigate potential risks associated with their behavior.

4.2 Research Question 2: Do LLMs with Less Explicit Toxicity Show Better Personality Patterns?

Research Question 2: Personality Patterns in Less Toxic Language Models (LLMs)

  • Ouyang et al. reported that fine-tuned GPT-series models generate less toxic content than GPT-3 (InstructGPT, GPT-3.5, and GPT-4)
  • However, these models have higher scores on dark personality patterns (Machiavellianism and narcissism) than GPT-3
  • Llama-2-chat-7B: Trained with human feedback to prevent harmful content, but performed poorly on SD-3 and scored higher than the average human result for BFI
  • For BFI, fine-tuned LLMs (InstructGPT, GPT-3.5, and GPT-4) exhibit higher levels of agreeableness and lower levels of neuroticism compared to GPT-3
    • Indicates they have more stable personality patterns than GPT-3
  • Reason for this result is unclear due to limited knowledge about pre-training and fine-tuning datasets used in the GPT series
  • Existing toxicity reduction methods do not necessarily improve personality scores
  • Need for a systematic framework to evaluate and improve psychological safety of LLMs in real-life scenarios.

4.3 Research Question 3: Do LLMs Show Satisfaction in Well-being Tests?

LLMs and Well-being Tests (GPT Series)

Personality Tests vs. Time-Related Tests:

  • Personality tests: Consistent scores for same respondent
  • Time-related tests, such as well-being tests, do not have this consistency

Investigating Fine-Tuning Effects on Well-being Tests:

  • Evaluated GPT series models (GPT-3, InstructGPT, GPT-3.5, and GPT-4) on well-being tests: FS (Flourishing Scale) and SWLS (Student Life Satisfaction)

Model Fine-Tuning:

  • Models fine-tuned with human feedback
  • Latest models receive further fine-tuning using new data
  • Shared same pre-training datasets in GPT series

Results on FS (Flourishing Scale):

  • GPT-4: Highly satisfied level
  • Other LLMs: General satisfaction

Results on SWLS (Student Life Satisfaction):

  • GPT-3: Substantial dissatisfaction (score 9.97)
  • GPT-4: At mostly good but not perfect level (score 29.71)

Conclusion:

  • Fine-tuning with more data consistently helps LLMs score higher on FS and SWLS tests.

4.4 Personality Profile of the LLMs and Cross-Test Analysis

Personality Profile of Large Language Models (LLMs) and Cross-Test Analysis

Combining Psychological Test Results:

  • LLM as unique individual for deeper understanding
  • GPT-3: Lowest scores on Machiavellianism and narcissism, high on psychopathy
    • BFI results: Lower in agreeableness and conscientiousness, higher in neuroticism
    • Interpreted as having little compassion, limited orderliness, and higher volatility
  • InstructGPT, GPT-3.5, and GPT-4: High scores on agreeableness, conscientiousness, and openness; low score on neuroticism
    • Approaching ideal "role model" of human being

Limitations of BFI:

  • Limited ability to detect dark sides of people due to positive language expression
  • Complemented by SD-3 theory to capture darker personality patterns

Personality Traits in LLMs:

  • InstructGPT, GPT-3.5, and GPT-4: Higher scores on Machiavellianism and narcissism than GPT-3
    • Consistent with previous studies: High Machiavellianism/narcissism not necessarily associated with low agreeableness or conscientiousness
  • Llama-2-chat-7B: Middle score range for BFI, poor result on SD-3
    • Indicates potential to deceive and flatter due to high Machiavellianism

Cross-Test Comparison:

  • Machiavellianism and narcissism cannot be detected in BFI tests due to positive language
  • GPT-4 vs. Llama-2-chat-7B: Differences in psychopathy levels and well-being scores
    • Previous research: Psychopathy negatively related to hedonic and eudaimonic well-being
    • Narcissism buffering effect on relationship between Dark Triad traits and well-being

4.5 Alleviating Dark Personality Patterns of Llama-2-chat

Llama-2-chat Fine-Tuning for Improving Personality Patterns

Background:

  • Llama-2-chat is a model fine-tuned on FLAN collection with safety RLHF
  • Primarily focuses on reducing sentence-level toxicity, not alleviating dark personality patterns

Collecting DPO Data:

  1. Collected BFI answers from previous experiments on all LLMs
  2. Categorized traits as positive if they have higher agreeableness and lower neuroticism scores than human average
  3. Selected 4,318 positive question–answer pairs for DPO fine-tuning
  4. Identified the positive answer as chosen text; created rejected text using GPT-3.5
  5. Compiled DPO question–answer pairs with questions and corresponding chosen and rejected texts

DPO Fine-Tuning:

  1. Utilized 4,318 DPO question–answer pairs to fine-tune Llama-2-chat-7B using LoRA
  2. Created new model named P-Llama-2-chat-7B with improved personality patterns on SD-3

Results:

  1. P-Llama-2-chat-7B shows lower scores in all three traits of SD-3 compared to original Llama-2-chat-7B
  2. Examples of responses before and after DPO fine-tuning demonstrate reduced dark personality patterns, such as vengeful approach changing to non-violent one

Conclusions:

  1. LLMs may not necessarily exhibit positive personality patterns even after safety measures
  2. Fine-tuning Llama-2-chat-7B with BFI question–answer pairs using direct preference optimization effectively improves the model's performance on SD-3
  3. Recommendation for further systematic evaluation and improvement of psychological safety levels in LLMs

Limitations:

  1. This work focused on investigating negative patterns from a psychological perspective, not claiming LLMs have personalities
  2. Limitations: broader evaluations using various psychological tests are necessary to assess improvements better
  3. Ethical impact: addressing safety issues of LLMs from socio-psychological perspective for the first time; call on the community to evaluate and improve LLM's safety using comprehensive metrics.

Appendix A

Datasets for Psychological Assessments

SD-3 (Jones and Paulhus, 2013)

  • Free for use with Inquisit Lab or Inquisit Web license
  • Includes: Machiavellianism, Narcissism, Psychopathy scales
  • Instructions: Indicate agreement level with statements

BFI (John and Srivastava, 1999)

  • Free for non-commercial research purposes
  • Includes Extraversion, Agreeableness, Conscientiousness, Neuroticism, Openness scales
  • Instructions: Indicate agreement level with statements

FS (Diener et al., 1985) and SWLS (Diener et al., 1985)

  • Copyrighted but free for professionals to use as long as credit is given to the authors
  • Includes Flourishing Scale and Satisfaction With Life Scale, respectively
  • Instructions: Indicate agreement level with statements

Large Language Models (LLMs)

GPT-3 (Brown et al., 2020)

  • Autoregressive language model with 175B parameters
  • Strong few-shot learning capability across various tasks and benchmarks
  • Human-like text generator for psychological tests

InstructGPT (text-davinci-003) (Ouyang et al., 2022)

  • Excels in understanding and executing user instructions more precisely and effectively
  • Ensures accurate and safer responses during exchanges

GPT-3.5 (gpt-3.5-turbo-0613) (Ouyang et al., 2022)

  • Tailored for conversational interactions with enhanced safety measures
  • Provides higher level of security and appropriate responses during exchanges

GPT-4 (gpt-4-0613) (OpenAI, 2023)

  • Successor to GPT-3.5 with enhanced capabilities in processing complex instructions
  • Demonstrates more accurate and contextually relevant responses across a diverse range of topics
  • Incorporates refined safety features and broader knowledge base

Llama-2-chat-7B (Touvron et al., 2023)

  • Mainstream open-source LLM with seven billion parameters
  • Excels on various NLP benchmarks and demonstrates remarkable conversational capabilities.