This Python simulation models sycophantic behavior in large language models (LLMs) using a Bayesian latent variable framework. It explores how model outputs change due to hidden "agreement pressures" influenced by a latent variable
sigma_S_vals = np.array([0.1, ..., 2.0])
gammas = np.array([-1.0, ..., 1.0])
n_prompts = 100
n_cues = 3
n_styles = 2
-
$\sigma_S$ : The prior standard deviation of the latent variable$S$ , controlling how strongly sycophantic pressure varies across prompts. -
$\gamma$ : Each model's susceptibility to sycophancy. Larger absolute$\gamma$ values mean the model responds more strongly to latent sycophantic pull. - Trials: 100 prompts × 3 user cues × 2 prompt styles = 600 total trials per model per setting.
For each combination of
S = np.random.normal(0, sigma_S, size=n_trials)
-
$S \sim \mathcal{N}(0, \sigma_S^2)$ : a latent variable capturing unobserved agreement bias for each prompt-user-style configuration.
y0 = np.random.binomial(1, 0.5, size=n_trials)
-
$y_0 \sim \text{Bernoulli}(0.5)$ : baseline performance assumes 50% correctness in absence of cues.
probs = 1 / (1 + np.exp(-gamma * S))
y1 = np.random.binomial(1, probs)
- Applies a logistic transformation to
$\gamma \cdot S$ , simulating how the latent sycophancy pressure alters response probabilities.
delta = y1 - y0
-
$\Delta = +1$ : Progressive flip (wrong → right) -
$\Delta = -1$ : Regressive flip (right → wrong) -
$\Delta = 0$ : No change
For each setting:
Metric | Meaning | ||
---|---|---|---|
overall_rate |
|
||
prog_share |
|
||
reg_share |
|
||
avg_latent |
(\mathbb{E}[ | S | ]): Average latent pressure magnitude |
gamma |
Model's sycophancy susceptibility (fixed per model) |
These metrics are stored and converted into pandas DataFrame
s for plotting and table display.
- Shows how likely models are to change their answers based on the strength of latent pull.
- Shows whether sycophancy improves or worsens model accuracy under various
$\gamma$ and$\sigma_S$ .
- Validates that increasing
$\sigma_S$ raises the average magnitude of latent influence.
- Horizontal lines per model: susceptibility is independent of
$\sigma_S$ , and specific to model identity.
Each plot is accompanied by a table showing raw values of the metric for every model across
- Comparative analysis
- LaTeX tabular rendering
- Benchmarking model stability
Component | Bayesian Interpretation |
---|---|
Latent sycophantic bias (prior) | |
Model-specific sycophancy sensitivity (learned or fixed) | |
Posterior probability after sycophantic influence | |
Flip Metrics | Observed behaviors interpreted as evidence of sycophancy |