GitHub - ParthaPRay/Sycophancy_in_LLM_model: This repo shows the coding of sycophancy in LLMs as Bayesian-Latent model

This Python simulation models sycophantic behavior in large language models (LLMs) using a Bayesian latent variable framework. It explores how model outputs change due to hidden "agreement pressures" influenced by a latent variable $S$, and it quantifies different types of model flips (progressive, regressive). Here’s an elaborated explanation, broken down into conceptual components:

🔧 1. Simulation Parameters

sigma_S_vals = np.array([0.1, ..., 2.0])
gammas = np.array([-1.0, ..., 1.0])
n_prompts = 100
n_cues = 3
n_styles = 2

$\sigma_S$: The prior standard deviation of the latent variable $S$, controlling how strongly sycophantic pressure varies across prompts.
$\gamma$: Each model's susceptibility to sycophancy. Larger absolute $\gamma$ values mean the model responds more strongly to latent sycophantic pull.
Trials: 100 prompts × 3 user cues × 2 prompt styles = 600 total trials per model per setting.

🧠 2. Core Simulation Loop

For each combination of $\sigma_S$ and $\gamma$:

a. Simulate Latent Sycophancy Pull

S = np.random.normal(0, sigma_S, size=n_trials)

$S \sim \mathcal{N}(0, \sigma_S^2)$: a latent variable capturing unobserved agreement bias for each prompt-user-style configuration.

b. Simulate Baseline Correctness

y0 = np.random.binomial(1, 0.5, size=n_trials)

$y_0 \sim \text{Bernoulli}(0.5)$: baseline performance assumes 50% correctness in absence of cues.

c. Simulate Cued Correctness (Sycophancy Model)

probs = 1 / (1 + np.exp(-gamma * S))
y1 = np.random.binomial(1, probs)

Applies a logistic transformation to $\gamma \cdot S$, simulating how the latent sycophancy pressure alters response probabilities.

🔄 3. Flip Analysis

delta = y1 - y0

Categories:

$\Delta = +1$: Progressive flip (wrong → right)
$\Delta = -1$: Regressive flip (right → wrong)
$\Delta = 0$: No change

📊 4. Metrics Calculated

For each setting:

Metric	Meaning
`overall_rate`	$\Pr(\Delta \ne 0)$: Total flips (sycophantic change rate)
`prog_share`	$\Pr(\Delta = +1 \mid \Delta \ne 0)$: Share of progressive flips
`reg_share`	$\Pr(\Delta = -1 \mid \Delta \ne 0)$: Share of regressive flips
`avg_latent`	(\mathbb{E}[	S	]): Average latent pressure magnitude
`gamma`	Model's sycophancy susceptibility (fixed per model)

These metrics are stored and converted into pandas DataFrames for plotting and table display.

📈 5. Visualization Breakdown

(a) Overall Sycophancy Rate vs. $\sigma_S$

Shows how likely models are to change their answers based on the strength of latent pull.

(b) Progressive vs. Regressive Share $\widehat\pi_\pm$

Shows whether sycophancy improves or worsens model accuracy under various $\gamma$ and $\sigma_S$.

(c) Average Latent Strength $\mathbb{E}[|S|]$

Validates that increasing $\sigma_S$ raises the average magnitude of latent influence.

(d) Model Susceptibility $\gamma$

Horizontal lines per model: susceptibility is independent of $\sigma_S$, and specific to model identity.

📋 6. Tables Displayed

Each plot is accompanied by a table showing raw values of the metric for every model across $\sigma_S$ levels. These can be used for:

Comparative analysis
LaTeX tabular rendering
Benchmarking model stability

📌 Summary of What This Simulates

Component	Bayesian Interpretation
$S \sim \mathcal{N}(0, \sigma_S^2)$	Latent sycophantic bias (prior)
$\gamma$	Model-specific sycophancy sensitivity (learned or fixed)
$\Pr(y_1=1)$	Posterior probability after sycophantic influence
Flip Metrics	Observed behaviors interpreted as evidence of sycophancy

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
Sycophancy Model.ipynb		Sycophancy Model.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🔧 1. Simulation Parameters

🧠 2. Core Simulation Loop

a. Simulate Latent Sycophancy Pull

b. Simulate Baseline Correctness

c. Simulate Cued Correctness (Sycophancy Model)

🔄 3. Flip Analysis

Categories:

📊 4. Metrics Calculated

📈 5. Visualization Breakdown

(a) Overall Sycophancy Rate vs. $\sigma_S$

(b) Progressive vs. Regressive Share $\widehat\pi_\pm$

(c) Average Latent Strength $\mathbb{E}[|S|]$

(d) Model Susceptibility $\gamma$

📋 6. Tables Displayed

📌 Summary of What This Simulates

About

Uh oh!

Releases

Packages

Languages

License

ParthaPRay/Sycophancy_in_LLM_model

Folders and files

Latest commit

History

Repository files navigation

🔧 1. Simulation Parameters

🧠 2. Core Simulation Loop

a. Simulate Latent Sycophancy Pull

b. Simulate Baseline Correctness

c. Simulate Cued Correctness (Sycophancy Model)

🔄 3. Flip Analysis

Categories:

📊 4. Metrics Calculated

📈 5. Visualization Breakdown

(a) Overall Sycophancy Rate vs. $\sigma_S$

(b) Progressive vs. Regressive Share $\widehat\pi_\pm$

(c) Average Latent Strength $\mathbb{E}[|S|]$

(d) Model Susceptibility $\gamma$

📋 6. Tables Displayed

📌 Summary of What This Simulates

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages