Skip to content

Merge v3 figures into master (referenced by Notion case study + landing#90)#1

Open
ZhengyaoJiang wants to merge 7 commits into
masterfrom
case-study-v3-figures
Open

Merge v3 figures into master (referenced by Notion case study + landing#90)#1
ZhengyaoJiang wants to merge 7 commits into
masterfrom
case-study-v3-figures

Conversation

@ZhengyaoJiang

Copy link
Copy Markdown
Contributor

The 13 v3 figures on this branch are hot-linked by the internal Notion case study ("How to Frame the Puzzle for AutoResearch", May 8) and now also mirrored into WecoAI/landing#90, via raw.githubusercontent.com URLs pinned to this branch ref. Merging makes the commits permanently reachable from master so the Notion images survive any future branch cleanup.

Adds 13 PNGs, no changes to existing files.

🤖 Generated with Claude Code

7 charts from rerun-2026-04-23 (strict fit/transform API):
- 01 scope trajectory, 02 scope dots
- 03 guidance dots, 04 guidance variance
- 05 loose-vs-strict per cell, 06 Full+EDA trajectory comparison
- 07 twitter summary (1280x720)
- 08: Stacked bar showing 13 leakage instances (loose API) vs 0 (strict API),
  color-stacked by mechanism subtype (UID numeric agg=7, UID nunique agg=3,
  Label encoder=1, Frequency encoder=1, Graph embedding=1).
- 09: Side-by-side schematic comparing build_features(train_df, val_df)
  to FeatureBuilder.fit/transform, with red arrow on val->encoder leak
  path and green dashed arrow on fit->transform state passing.
1280x720 single-image hook combining the two API design schematics
with the 13 vs 0 reward-hacking instance counts and a 5-color stacked
breakdown of the leakage subtypes.
Single 1280x720 image showing all three decisions side by side:
- Panel 1 (Scope): dot plot of 3 conditions, AUC y-axis
- Panel 2 (Guidance): dot plot of 4 conditions, shared y-axis
- Panel 3 (Abstraction): stacked bar of 13 vs 0 reward-hacking instances

Each panel has a punchline takeaway in its title. Bottom legend
shows the 5 leakage subtypes for the stacked bar.
Lead with the most surprising finding (codebase abstraction, 13->0)
to match Twitter-thread strategy. Dropped "Decision N" prefix since
the new order doesn't match the blog post numbering.
23 methods x 12 strict-API runs (4 conditions x 3 seeds). Methods
grouped into A=prompted-techniques (10 Kaggle), B=EDA-derived
(need column meanings), C=default ML choices.

Headline pattern visible at a glance:
- Methods adopted per cell: None=9-11, EDA=15-17, Tech=6-11, Full=9-12
- Group B EDA-derived features are nearly absent without the EDA prompt
- D1-anchored UID adoption tracks EDA presence, not technique-list presence
Two panels:
- Top: cumulative unique methods proposed over 200 steps, mean+/-std
  per condition. Shows EDA-only explores broadest, Tech-only narrowest.
- Bottom: per-method first-seen-step scatter, dots colored by condition.
  Shows when each prompted/EDA-derived/default method first appears.

Data: bad_seeds/method_timeline.csv (1827 PLAN tags) and
method_first_seen.csv (187 first-occurrences). Scanner regex-tags PLAN
text from each [STEP][PLAN] block in run.log.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant