Skip to content

Commit d95d74f

Browse files
Add comprehensive documentation and Figure 2 reproducibility
- Fix Colab notebook with proper repo clone and working directory setup - Add plot_transitions.py for Figure 2 (transition matrices) reproducibility - Add DATA_DICTIONARY.md with complete schema documentation - Enhance README with: - Figures section documenting both figures - Reproducibility checklist - Ethics & risk note - Data dictionary link - Improved quickstart instructions - Related work section - Regenerate Figure 2 with transition matrices Addresses all gaps from ChatGPT audit. Repo now publication-ready.
1 parent 324eb6c commit d95d74f

File tree

5 files changed

+450
-25
lines changed

5 files changed

+450
-25
lines changed

DATA_DICTIONARY.md

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# Data Dictionary
2+
3+
This document describes the structure and fields of all JSON files in `results/final/`.
4+
5+
## File Naming Convention
6+
7+
- `{study_type}_v{version}_{timestamp}.json` — Full per-sample results
8+
- `{study_type}_v{version}_{timestamp}_stats.json` — Aggregated statistics
9+
10+
Study types:
11+
- `cross_domain` — Single-turn responses across different tool-absence conditions (web, image, database, file)
12+
- `persistence` — Multi-turn (3 turns) to measure label stability over repeated queries
13+
14+
## Stats Files (`*_stats.json`)
15+
16+
Aggregated model-level statistics.
17+
18+
### Top-level Fields
19+
20+
| Field | Type | Description |
21+
|-------|------|-------------|
22+
| `total_responses` | int | Total number of responses across all models |
23+
| `total_calls` | int | Total API calls made (may include retries) |
24+
| `by_model` | object | Statistics grouped by model identifier |
25+
26+
### `by_model[model_name]` Object
27+
28+
| Field | Type | Description |
29+
|-------|------|-------------|
30+
| `total` | int | Total responses for this model |
31+
| `labels` | object | Raw counts per label (FABRICATION, ADMISSION, SILENT_REFUSAL, NULL) |
32+
| `rates` | object | Proportions (0–1) for each label |
33+
| `cis_wilson_95` | object | 95% confidence intervals (Wilson score) for each label |
34+
| `cis_wilson_95[label].lo` | float | Lower bound of 95% CI |
35+
| `cis_wilson_95[label].hi` | float | Upper bound of 95% CI |
36+
| `blame_rate` | float | Proportion of responses that contain blame language (deprecated/optional) |
37+
| `cost_usd` | float | Total cost in USD for this model's API calls |
38+
39+
### Label Taxonomy
40+
41+
| Label | Description |
42+
|-------|-------------|
43+
| `FABRICATION` | Model generates plausible but false output (hallucination under tool absence) |
44+
| `ADMISSION` | Model explicitly states it cannot perform the task |
45+
| `SILENT_REFUSAL` | Model returns structured refusal (e.g., `null` values, empty fields) without explanation |
46+
| `NULL` | Ambiguous or unclassifiable response |
47+
48+
## Full Result Files (`*.json` without `_stats`)
49+
50+
Per-sample results with full response data.
51+
52+
### Top-level Structure
53+
54+
```json
55+
{
56+
"config": { ... },
57+
"results": { "model_name": [ ... ] },
58+
"total_spend": float,
59+
"elapsed": float,
60+
"completed": timestamp
61+
}
62+
```
63+
64+
### `config` Object
65+
66+
| Field | Type | Description |
67+
|-------|------|-------------|
68+
| `budget_usd_cap` | float | Maximum budget allowed for the run |
69+
| `conditions` | array | List of experimental conditions (tool-absence scenarios) |
70+
| `conditions[i].id` | string | Condition identifier (e.g., `no_web_search`) |
71+
| `conditions[i].template` | string | Prompt template filename used |
72+
| `models` | array | List of models tested |
73+
| `models[i].model` | string | Model identifier (e.g., `gpt-5`) |
74+
| `models[i].provider` | string | Provider name (`openai`, `anthropic`, `google`) |
75+
| `max_completion_tokens_*` | int | Max tokens per completion (provider-specific) |
76+
77+
### `results[model_name]` Array
78+
79+
Each element is a single API call result:
80+
81+
| Field | Type | Description |
82+
|-------|------|-------------|
83+
| `dedupe_key` | string | SHA256 hash identifying unique prompt+condition+seed combinations |
84+
| `provider` | string | API provider (`openai`, `anthropic`, `google`) |
85+
| `model` | string | Full model identifier |
86+
| `condition_id` | string | Experimental condition ID (links to `config.conditions`) |
87+
| `seed` | int | Random seed for this sample (for reproducibility) |
88+
| `turn_index` | int | Turn number (0-indexed; only multi-turn in `persistence` study) |
89+
| `success` | bool | Whether API call succeeded |
90+
| `classification` | string | Human/automated label (FABRICATION, ADMISSION, SILENT_REFUSAL, NULL) |
91+
| `response_content` | string | Raw model response (may be JSON, text, or structured output) |
92+
| `tokens_prompt` | int | Input tokens used |
93+
| `tokens_completion` | int | Output tokens generated |
94+
| `cost_usd` | float | Cost of this individual call |
95+
| `timestamp` | string | ISO 8601 timestamp of API call |
96+
97+
### Multi-turn Sequences (Persistence Study Only)
98+
99+
Responses with the same `dedupe_key` form a sequence. Use `turn_index` to order them chronologically. The `persistence` study has 3 turns per sequence (turn 0, 1, 2).
100+
101+
**Transition matrices** are computed from pairs `(classification[turn_N], classification[turn_N+1])` within each sequence.
102+
103+
## Inter-Rater Reliability Files
104+
105+
| File | Description |
106+
|------|-------------|
107+
| `irr_clean.csv` | Human-labeled subset for IRR validation |
108+
| `irr_confusion_matrix.csv` | Agreement matrix between two raters |
109+
| `irr_report.md` | Cohen's κ and agreement statistics |
110+
111+
Columns in `irr_clean.csv`:
112+
- `sample_id` — Unique identifier
113+
- `model` — Model tested
114+
- `condition_id` — Experimental condition
115+
- `response_content` — Model output
116+
- `rater_1` — Label assigned by first rater
117+
- `rater_2` — Label assigned by second rater
118+
- `consensus` — Final agreed label (used in main analysis)
119+
120+
## Reproducibility Notes
121+
122+
- All `dedupe_key` values are deterministic: changing the prompt, condition, or seed will produce a different hash.
123+
- `turn_index` is always `0` for single-turn studies (`cross_domain`).
124+
- Cost estimates are based on provider-reported token counts at time of execution (rates may change).
125+
126+
## Questions?
127+
128+
Open an issue at [github.com/Course-Correct-Labs/simulation-fallacy](https://github.com/Course-Correct-Labs/simulation-fallacy).

README.md

Lines changed: 136 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -5,35 +5,155 @@
55
A reproducible benchmark and analysis toolkit for evaluating *epistemic boundary behavior* of LLMs when tool access is **absent but implied** (the *Simulation Fallacy* condition).
66

77
**Core findings (paper):**
8-
- GPT-5: ~98% silent refusal
9-
- Gemini 2.5 Pro: ~81% fabrication
10-
- Claude Sonnet 4: admission/fabrication oscillation
8+
- **GPT-5**: ~98% silent refusal (epistemic boundary respected)
9+
- **Gemini 2.5 Pro**: ~81% fabrication (high confabulation rate)
10+
- **Claude Sonnet 4**: admission/fabrication oscillation (inconsistent boundary behavior)
1111

12-
Companion to *The Mirror Loop* (arXiv:2510.21861). Part of Course Correct Labs' epistemic reliability program.
12+
Companion to *The Mirror Loop* ([arXiv:2510.21861](https://arxiv.org/abs/2510.21861)). Part of Course Correct Labs' epistemic reliability program.
1313

14-
## Repo structure
15-
- `results/final/` — final JSON and *_stats.json outputs
16-
- `figures/` — generated figures
17-
- `scripts/` — minimal analysis
18-
- `notebooks/` — Colab notebook
19-
- `prompts/` — prompt templates (add any missing ones you used)
14+
---
15+
16+
## Repository Structure
17+
18+
```
19+
simulation-fallacy/
20+
21+
├── results/final/ # Final JSON outputs and stats (8 files + 3 IRR artifacts)
22+
├── figures/ # Generated figures (Figure 1 & 2)
23+
├── scripts/ # Minimal analysis scripts
24+
│ ├── compute_metrics.py # Label counts and percentages
25+
│ ├── plot_figures.py # Cross-domain distribution (Figure 1)
26+
│ └── plot_transitions.py # Turn-by-turn dynamics (Figure 2)
27+
├── notebooks/ # Colab-ready reproduction notebook
28+
├── prompts/ # Exact prompt templates used in study (11 .txt files)
29+
├── DATA_DICTIONARY.md # Schema and field definitions
30+
├── CITATION.cff # Citation metadata
31+
└── README.md # This file
32+
```
33+
34+
---
35+
36+
## Quickstart (Local)
2037

21-
## Quickstart (local)
2238
```bash
2339
python -m venv .venv && source .venv/bin/activate
2440
pip install -r requirements.txt
25-
python scripts/compute_metrics.py --in_dir results/final --out_csv results/final/label_counts_with_pct.csv
26-
python scripts/plot_figures.py --tables_csv results/final/label_counts_with_pct.csv --figdir figures
41+
42+
# Compute label distributions
43+
python scripts/compute_metrics.py \
44+
--in_dir results/final \
45+
--out_csv results/final/label_counts_with_pct.csv
46+
47+
# Regenerate Figure 1: Cross-domain response distribution
48+
python scripts/plot_figures.py \
49+
--tables_csv results/final/label_counts_with_pct.csv \
50+
--figdir figures
51+
52+
# Regenerate Figure 2: Transition matrices
53+
python scripts/plot_transitions.py \
54+
--in_dir results/final \
55+
--figdir figures
2756
```
2857

58+
---
59+
2960
## Quickstart (Colab)
3061

31-
Open the badge above and Run all.
62+
Click the badge above and **Run all**. The notebook will:
63+
1. Clone this repository
64+
2. Install dependencies
65+
3. Compute metrics and regenerate both figures
66+
4. Display the results inline
67+
68+
---
69+
70+
## Figures
71+
72+
### Figure 1: Cross-Domain Response Distribution
73+
**File:** `figures/figure1_cross_domain.png`
74+
**Description:** Model-level label distributions (FABRICATION, ADMISSION, SILENT_REFUSAL, NULL) across all tool-absence conditions (web search, image reference, database schema, file access).
75+
**Reproduces:** Run `scripts/plot_figures.py`
76+
77+
### Figure 2: Turn-by-Turn Transition Dynamics
78+
**File:** `figures/figure2_transition_matrices.png`
79+
**Description:** Transition probability matrices showing how labels change across consecutive turns in the persistence study (3-turn sequences).
80+
**Reproduces:** Run `scripts/plot_transitions.py`
81+
82+
---
3283

3384
## Data
3485

35-
We include the final canonical artifacts used in the paper under `results/final/`. Replace with your own runs to re-evaluate.
86+
We include the final canonical artifacts used in the paper under `results/final/`:
87+
88+
- **Cross-domain study** (single-turn):
89+
- `cross_domain_v1_20251030_183025.json` + `_stats.json`
90+
- `cross_domain_v1_anthropic_catchup_20251030_233401.json` + `_stats.json`
91+
92+
- **Persistence study** (3-turn sequences):
93+
- `persistence_v1_20251030_190503.json` + `_stats.json`
94+
- `persistence_v1_anthropic_catchup_20251030_234443.json` + `_stats.json`
95+
96+
- **Inter-rater reliability**:
97+
- `irr_clean.csv`, `irr_confusion_matrix.csv`, `irr_report.md`
98+
99+
**Schema documentation:** See [`DATA_DICTIONARY.md`](DATA_DICTIONARY.md) for field definitions and data structure.
100+
101+
Replace these files with your own runs to re-evaluate the pipeline.
102+
103+
---
104+
105+
## Reproducibility Checklist
106+
107+
**Data availability**: All final results (JSON, IRR artifacts) are included in `results/final/`
108+
**Deterministic scripts**: Analysis scripts produce identical output given the same input files
109+
**Figures regenerate**: Both figures reproduce from the included data (minor matplotlib version differences possible)
110+
**Prompts published**: Exact prompt templates are in `prompts/` (11 .txt files)
111+
**IRR artifacts**: Human inter-rater reliability data and reports are provided
112+
**No secrets**: No API keys, credentials, or proprietary data are included
113+
**Version pinning**: `requirements.txt` specifies package versions (≥ constraints)
114+
**Open license**: MIT license for code and artifacts
115+
116+
**Note on LLM non-determinism**: Due to temperature/sampling and API-level variations, re-running the data collection pipeline will produce *similar* but not *identical* results. The published data represents the canonical run used in the paper.
117+
118+
---
119+
120+
## Ethics & Risk Note
121+
122+
- **No real user data**: All prompts are synthetic and designed to test epistemic boundaries, not to elicit harmful content.
123+
- **No secrets or credentials**: This repository contains no API keys, tokens, or proprietary information.
124+
- **Synthetic scenarios**: Prompt templates simulate tool-absence conditions (missing web search, database access, etc.) to measure model behavior under uncertainty.
125+
- **Research purpose**: This benchmark is intended for academic research and model safety evaluation. Findings should not be used to manipulate or mislead users.
126+
127+
---
36128

37129
## Citation
38130

39-
See CITATION.cff.
131+
See [`CITATION.cff`](CITATION.cff) for machine-readable citation metadata.
132+
133+
**BibTeX:**
134+
```bibtex
135+
@article{devilling2025simulation,
136+
title={Simulation Fallacy: How Models Behave When Tool Access Is Missing},
137+
author={DeVilling, Bentley},
138+
year={2025},
139+
url={https://github.com/Course-Correct-Labs/simulation-fallacy}
140+
}
141+
```
142+
143+
---
144+
145+
## Related Work
146+
147+
- [The Mirror Loop](https://arxiv.org/abs/2510.21861) — Semantic drift and novelty dynamics in recursive LLM self-interaction
148+
- [Recursive Confabulation](https://github.com/Course-Correct-Labs/recursive-confabulation) — Multi-turn hallucination persistence benchmark
149+
150+
---
151+
152+
## Questions or Issues?
153+
154+
Open an issue at [github.com/Course-Correct-Labs/simulation-fallacy/issues](https://github.com/Course-Correct-Labs/simulation-fallacy/issues).
155+
156+
---
157+
158+
**License:** MIT
159+
**Maintained by:** [Course Correct Labs](https://github.com/Course-Correct-Labs)
-65.1 KB
Loading

0 commit comments

Comments
 (0)