GitHub - ChicagoHAI/story-cot-codex: research: Story CoT | Generated by Idea Explorer on 2025-11-23

Project Overview

Story-style chain-of-thought (CoT) prompting was tested against direct answers and stepwise CoT on GSM8K and AQuA subsets.
Using a small open model (Qwen2.5-0.5B-Instruct) on 10-sample pairs, story CoTs did not beat stepwise CoT; self-consistency over stories offered no gain.
Full report: see REPORT.md.

Stepwise CoT achieved the highest accuracy on both datasets (0.3 vs story 0.1–0.2).
Story framing often hurt answer formatting (missing option letters) and did not improve arithmetic.
Self-consistency (k=3, temp=0.7) over stories failed to outperform single-shot.

Activate env: source .venv/bin/activate
Run experiments (HF fallback, small model):
python -m research_workspace.story_cot_experiment --provider hf --hf_model Qwen/Qwen2.5-0.5B-Instruct --gsm8k_n 10 --aqua_n 10 --save_dir results
- Set --model to API model (e.g., gpt-4.1) and ensure OPENAI_API_KEY for API runs.
Analyze and plot: python -m research_workspace.analyze_results --raw_path results/raw_outputs.jsonl --save_dir results
Outputs: results/metrics.json, results/analysis.json, plots in results/plots/.

planning.md – experiment design.
REPORT.md – full report with results.
src/research_workspace/story_cot_experiment.py – data loading, prompting, evaluation harness.
src/research_workspace/analyze_results.py – stats + plots.
results/ – raw generations, metrics, analysis, plots.
datasets/ – local GSM8K and AQuA copies (not in git).

Current run used Qwen2.5-0.5B-Instruct due to missing OpenAI key; token usage is unavailable in metrics.
GPU detected (cuda:0); adjust --hf_model or device_map if running CPU-only.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.idea-explorer		.idea-explorer
code		code
datasets		datasets
logs		logs
papers		papers
results		results
src/research_workspace		src/research_workspace
.gitignore		.gitignore
.resource_finder_complete		.resource_finder_complete
README.md		README.md
REPORT.md		REPORT.md
literature_review.md		literature_review.md
planning.md		planning.md
pyproject.toml		pyproject.toml
resources.md		resources.md
uv.lock		uv.lock