Skip to content

ParthaPRay/Sycophancy_in_LLM_model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

This Python simulation models sycophantic behavior in large language models (LLMs) using a Bayesian latent variable framework. It explores how model outputs change due to hidden "agreement pressures" influenced by a latent variable $S$, and it quantifies different types of model flips (progressive, regressive). Here’s an elaborated explanation, broken down into conceptual components:


🔧 1. Simulation Parameters

sigma_S_vals = np.array([0.1, ..., 2.0])
gammas = np.array([-1.0, ..., 1.0])
n_prompts = 100
n_cues = 3
n_styles = 2
  • $\sigma_S$: The prior standard deviation of the latent variable $S$, controlling how strongly sycophantic pressure varies across prompts.
  • $\gamma$: Each model's susceptibility to sycophancy. Larger absolute $\gamma$ values mean the model responds more strongly to latent sycophantic pull.
  • Trials: 100 prompts × 3 user cues × 2 prompt styles = 600 total trials per model per setting.

🧠 2. Core Simulation Loop

For each combination of $\sigma_S$ and $\gamma$:

a. Simulate Latent Sycophancy Pull

S = np.random.normal(0, sigma_S, size=n_trials)
  • $S \sim \mathcal{N}(0, \sigma_S^2)$: a latent variable capturing unobserved agreement bias for each prompt-user-style configuration.

b. Simulate Baseline Correctness

y0 = np.random.binomial(1, 0.5, size=n_trials)
  • $y_0 \sim \text{Bernoulli}(0.5)$: baseline performance assumes 50% correctness in absence of cues.

c. Simulate Cued Correctness (Sycophancy Model)

probs = 1 / (1 + np.exp(-gamma * S))
y1 = np.random.binomial(1, probs)
  • Applies a logistic transformation to $\gamma \cdot S$, simulating how the latent sycophancy pressure alters response probabilities.

🔄 3. Flip Analysis

delta = y1 - y0

Categories:

  • $\Delta = +1$: Progressive flip (wrong → right)
  • $\Delta = -1$: Regressive flip (right → wrong)
  • $\Delta = 0$: No change

📊 4. Metrics Calculated

For each setting:

Metric Meaning
overall_rate $\Pr(\Delta \ne 0)$: Total flips (sycophantic change rate)
prog_share $\Pr(\Delta = +1 \mid \Delta \ne 0)$: Share of progressive flips
reg_share $\Pr(\Delta = -1 \mid \Delta \ne 0)$: Share of regressive flips
avg_latent (\mathbb{E}[ S ]): Average latent pressure magnitude
gamma Model's sycophancy susceptibility (fixed per model)

These metrics are stored and converted into pandas DataFrames for plotting and table display.


📈 5. Visualization Breakdown

(a) Overall Sycophancy Rate vs. $\sigma_S$

  • Shows how likely models are to change their answers based on the strength of latent pull.

(b) Progressive vs. Regressive Share $\widehat\pi_\pm$

  • Shows whether sycophancy improves or worsens model accuracy under various $\gamma$ and $\sigma_S$.

(c) Average Latent Strength $\mathbb{E}[|S|]$

  • Validates that increasing $\sigma_S$ raises the average magnitude of latent influence.

(d) Model Susceptibility $\gamma$

  • Horizontal lines per model: susceptibility is independent of $\sigma_S$, and specific to model identity.

📋 6. Tables Displayed

Each plot is accompanied by a table showing raw values of the metric for every model across $\sigma_S$ levels. These can be used for:

  • Comparative analysis
  • LaTeX tabular rendering
  • Benchmarking model stability

📌 Summary of What This Simulates

Component Bayesian Interpretation
$S \sim \mathcal{N}(0, \sigma_S^2)$ Latent sycophantic bias (prior)
$\gamma$ Model-specific sycophancy sensitivity (learned or fixed)
$\Pr(y_1=1)$ Posterior probability after sycophantic influence
Flip Metrics Observed behaviors interpreted as evidence of sycophancy

About

This repo shows the coding of sycophancy in LLMs as Bayesian-Latent model

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published