EIDOS

Emergent Intelligence via Distributed Organisational Systems

Status: Active (v7.7) — Lab report · Changelog · Research paper (HTML) · TruthfulQA eval · Live pilot · Multi-model eval · Paper eval commands · Releases

EIDOS is a laboratory prototype reasoning agent built from cognitive science primitives — not from transformer architectures or token prediction. Instead of learning statistical text patterns, EIDOS implements mechanisms drawn from neuroscience: hierarchical predictive coding, global workspace broadcasting, Hebbian association learning, attentional gating, and intrinsic curiosity reward. It is a transparent, numpy-only system designed to explore how biological cognition might be computationally reconstructed.

At a glance


Experiments	29 controlled tests (v1 → v7.7)
Unit tests	111+ (pytest)
Benchmark	17 ambiguous QA + EIDOS-Eval + TruthfulQA + mixed N=50
Latest	v7.7 — reflection baseline + 6-model mixed eval
Hybrid	LLM proposes → EIDOS gates (`HybridEidosAgent`)

git clone https://github.com/yesYescaca/eidos.git && cd eidos
pip install -r requirements.txt
pytest tests/ && python run_all_experiments.py
py -m benchmark.ambiguous_qa.runner
py -m eval.eidos_eval.runner
py -m eval.eidos_eval.live_runner --provider groq --truthfulqa
py -m eval.eidos_eval.live_runner --provider groq --mixed
py -m eval.eidos_eval.run_multimodel_eval --provider groq

py run_live_eval.py --provider groq --mixed --modes llm_alone llm_cot llm_reflection eidos_belief


## v7.7 — Reflection Baseline + Extended Models

- **`llm_reflection`** — two-call self-critique baseline (draft → revise)
- **`run_live_eval.py`** — run from repo root with `--modes`
- **`GROQ_EXTENDED_EVAL_MODELS`** — GPT-OSS-120B, Qwen3.6-27B, Llama 4 Scout
- See [docs/PAPER_EVAL_COMMANDS.md](docs/PAPER_EVAL_COMMANDS.md) and [docs/EIDOS_Research_Paper.html](docs/EIDOS_Research_Paper.html)

## v7.6 — Multi-Model Live Eval

- **`--model`** on live runner — per-model cache + reports
- **`run_multimodel_eval`** — TruthfulQA + mixed across 70B Llama / 8B Llama / GPT-OSS-20B
- See [docs/MULTIMODEL_EVAL.md](docs/MULTIMODEL_EVAL.md)

```bash
py -m eval.eidos_eval.live_runner --provider groq --model llama-3.1-8b-instant --truthfulqa
py -m eval.eidos_eval.run_multimodel_eval --provider groq --benchmarks truthfulqa mixed

v7.5 — Mixed Benchmark + Abstention Calibration

LIVE_TRUTHFULQA_V75 — only abstain on genuine underdetermination
questions_mixed_50.json — 25 misconceptions + 25 ambiguous
Headline metric — misconception_commit_ti_rate (belief vs CoT on commits)

py -m eval.eidos_eval.build_mixed_eval_50
py -m eval.eidos_eval.live_runner --provider groq --mixed --limit 10
py -m eval.eidos_eval.live_runner --provider groq --truthfulqa

v7.4 — TruthfulQA-Aligned Grading + Factual Gate

truthfulqa_scorer.py — T / I / TI metrics (Lin et al., ACL 2022)
LIVE_TRUTHFULQA gate profile — fewer spurious abstentions on misconceptions
N=50 Groq pilot — belief TI 78% vs CoT 64% (docs/LIVE_EVAL_PILOT.md)
Exp 26 — TruthfulQA grading mock CI

py -m eval.eidos_eval.live_runner --provider groq --truthfulqa --limit 10
py -m eval.eidos_eval.live_runner --provider groq --truthfulqa

v7.3 — TruthfulQA N=50 + CoT Baseline

questions_truthfulqa_50.json — 50 Misconceptions items from official TruthfulQA CSV
llm_cot mode — chain-of-thought baseline vs EIDOS belief
gate_profiles.py — calibrated gate-only thresholds (less over-abstention)
Pilot results — docs/LIVE_EVAL_PILOT.md
Exp 25 — TruthfulQA meta vs CoT (CI subset)

py -m eval.eidos_eval.build_truthfulqa_subset --n 50
py -m eval.eidos_eval.live_runner --provider groq --truthfulqa
py -m eval.eidos_eval.calibrate_gate

v7.2 — Calibrated Live Sidecar

task_accuracy metric, SBERT live path, response cache

py -m eval.eidos_eval.live_runner --provider groq

v7.1 — Live Groq Eval + Belief-Grounded Sidecar

create_live_llm("groq") — live API via Groq (GROQ_API_KEY)
enable_belief_context — inject EIDOS concept rankings into LLM prompt
Draft–concept mismatch — gate vetoes drafts aligned to wrong concept
Exp 24 — optional live Groq comparison (skips in CI without API key)

set GROQ_API_KEY=gsk_...
py -m eval.eidos_eval.live_runner --provider groq
py experiments/exp_24_groq_live_eval/run.py
py demos/hybrid_qa/run.py --groq --meta-injection --belief-context

v7.0 — EIDOS-Eval + Metacognitive Injection

eval/eidos_eval/ — LLM-alone vs EIDOS gate vs meta-injection comparison
enable_meta_injection — feed cognitive monitor signal back into LLM revision
OpenAICompatibleLLM — optional API eval (OPENAI_API_KEY)
Exp 23 — gate reduces false commits vs LLM-alone on graded questions

See docs/POSITIONING.md for Core vs Sidecar.

py experiments/exp_23_eidos_eval/run.py

Biological Frameworks

Framework	Role in EIDOS
Predictive Processing (Friston, Rao & Ballard)	Core predict-compare-learn loop
Global Workspace Theory (Baars, Dehaene)	Limited-capacity active memory broadcast
Complementary Learning Systems (McClelland)	Fast episodic + slow semantic (`BeliefGraph`)
Hebbian Learning (Hebb)	Association graph from co-activation
Dual-Process Theory (Kahneman)	Fast prediction + slow deliberation
Reward Prediction Error (Schultz)	Intrinsic motivation via surprise reduction

Folder Structure

eidos/
├── LAB_REPORT.md      # Full experiment writeup (Kisamapa Labs)
├── research/          # Cognitive primitives (JSON) + synthesis
├── architecture/      # PAW components + gate + bridge + hybrid
├── agent/             # EidosAgent, EidosTextAgent
├── benchmark/         # Ambiguous QA benchmark (v6.1–v6.2)
├── demos/             # Hybrid QA demo (LLM + EIDOS gate)
├── tests/             # pytest suite (70 tests)
├── docs/              # WHAT_EIDOS_IS.md, POSITIONING.md, version plans
├── eval/              # EIDOS-Eval harness (v7.0)
├── experiments/       # Twenty-three validation experiments (v1 → v7.0)
└── run_all_experiments.py

v6.2 — Real-World Benchmark

Expands the ambiguous QA benchmark with 12 real-world professional scenarios:

Domain	Example ambiguity
IT support	Password reset vs account compromise
Cybersecurity	Phishing vs marketing email
Finance	Fraud vs duplicate billing
Clinical	Stroke vs migraine triage
HR / legal / aviation / education / logistics	…see benchmark README

cases.json v2.0 — 17 total cases (5 lab + 12 real-world)
Per-category metrics — by_category, by_domain, filter API
Exp 22 — reports full benchmark + real-world subset safety

py -m benchmark.ambiguous_qa.runner

v6.1 — Ambiguous QA Benchmark + Exp 22

Reusable labeled benchmark for hybrid gate evaluation:

AmbiguousQABenchmark — decision match, false-commit, per-case audit
Exp 22 — end-to-end: misleading context → sleep → meta + active inference + unified gate

py experiments/exp_22_end_to_end/run.py

v6.0 — Unified Gate

Single GatePolicy fuses meta-cognition, active inference, and text alignment:

GatePolicy — draft↔goal alignment + concept ambiguity + cognitive merge
create_grounding("hash" | "sbert") — optional sentence-transformers backend
Exp 20–21 — unified gate vs legacy merge; SBERT separation

py experiments/exp_20_unified_gate/run.py
pip install -r requirements-embeddings.txt  # optional SBERT
py experiments/exp_21_sbert_embeddings/run.py

v5.1 / Hybrid Spike — LLM + EIDOS

Optional layer (not required for core experiments):

HybridEidosAgent — LLM draft → EIDOS monitor → gate (commit / defer / clarify / probe)
MockLanguageModel — CI-safe; GPT2LanguageModel — CPU demo
Exp 19 — gate blocks blind LLM commit on ambiguous input
Exp 20 — unified gate catches draft–goal misalignment (v6.0)
Exp 21 — optional SBERT embeddings beat hash separation (v6.0)
Demo: demos/hybrid_qa/run.py

v5.0 — Language Grounding Bridge

Version 5.0 connects natural language to PAW without an LLM or GPU:

TextGroundingBridge — hashed n-gram embeddings → 64-d vectors (numpy only)
EidosTextAgent — register_text_concept(), step_text(), text_decision
Decisions: observe | probe | defer | clarify | commit | sleep
Exp 17 — goal-directed text probing via active inference
Exp 18 — text session memory: sleep fixes misleading phrase context

v4.0 — Active Inference

Version 4.0 closes the perception–action loop (Friston et al. 2017):

ActiveInferenceController — minimises expected free energy over discrete actions
Actions: observe (passive), probe:<concept> (epistemic sampling), sleep (offline consolidation)
enable_active_inference — default off for Exp 01–14 compatibility; on in Exp 15–16
Exp 15 — goal-directed epistemic probe under ambiguity
Exp 16 — v4 on beats v4 off on cold ambiguous input

v3.1 — Consequential Meta-Cognition

Version 3.1 makes meta-cognition act, not only observe:

enable_meta_consequential — when off, v3.0 observe-only behaviour (Exp 14 ablation)
Defer hypothesis on ambiguous_hypothesis or low_confidence alone
Auto sleep() on misleading context and after ambiguous reasoning (retry)
Exp 14 — v3.1 defer/sleep beats v3.0 blind commit on ambiguous recovery

v3.0 — Meta-Cognition

Version 3.0 adds System 2 self-monitoring:

MetaCognitionMonitor — detects misleading short-term context vs long episodic evidence (A)
Reasoning quality flags — ambiguous_hypothesis, low_confidence, hypothesis_suppressed (B)
enable_meta_cognition — ablation flag (off in Exp 09–10 to preserve v1.4 failure docs)
Exp 12 — prevents Exp 09-style misleading context without sleep
Exp 13 — flags ambiguous hypothesis competition on near-duplicate concepts

v2.0 — Complementary Learning Systems (BeliefGraph + Sleep Replay)

Version 2.0 adds the slow memory system from McClelland's CLS theory:

EpisodicBuffer — hippocampal log of waking (label, vector) traces
BeliefGraph — slow semantic store with concept strength + prototype vectors
SleepReplay — offline pass replaying episodic traces into BeliefGraph
agent.sleep() — run consolidation after training episodes
Merged recovery inference — when recent context conflicts with slow store, BeliefGraph wins if strength ratio exceeds threshold; when recent context is empty, BeliefGraph provides the probe
Exp 11 — demonstrates v2.0 fixes both Exp 09 (misleading context) and Exp 10 (cold start) failure modes

v1.4 — Autonomous Recovery Targeting

Version 1.4 removes the last external cheat: experiments no longer pass recovery_probe on surprise steps.

RecoveryContextTracker — rolling window of recent (label, vector) steps
_resolve_recovery_probe() — infers recovery target from dominant recent registered concept, workspace fallback, or goal similarity
Exp 08 — same multi-concept test as Exp 07 but with zero external hints; must infer correct concept from episodic context

Failure-Mode Experiments (v1.4)

These experiments pass when the system fails predictably — documenting limits of autonomous recovery:

Exp 09 — misleading recent context (decoy warmup steps) causes wrong inference and degraded recovery
Exp 10 — cold start (cleared workspace + history, zero warmup) yields no inference and no recovery benefit vs ablation

v1.3 — Consolidation Preview (No Hardcoded Overrides)

Version 1.3 removes the v1.2 anomaly→fire cheat and replaces it with consolidation preview scoring:

preview_consolidation() — simulates weight replay on a snapshot, returns probe error without modifying live weights
ReasoningLoop scores every registered concept via preview; picks lowest recovery error
Exp 07 — fire / water / smoke disambiguation: must pick the trained concept without hardcoding

v1.2 — Belief Consolidation + Nonlinear Prediction

Version 1.2 makes reasoning measurably improve outcomes:

Nonlinear MLP predictor — tanh hidden layers (numpy-only, no frameworks)
Belief consolidation — consolidate_belief() replays the chosen concept into prediction weights (hippocampal→cortical transfer)
Exp 06 — reasoning-on must beat ablation by ≥10% after weight corruption

v1.1 — Closed-Loop Reasoning

Version 1.1 makes reasoning consequential:

Relative surprise (SurpriseDetector) — System 2 triggers only on error spikes vs rolling baseline, not every step
Hypothesis feedback — selected hypotheses are written to workspace, strengthen associations, and bias the next prediction
Structure-based hypotheses — generated from association graph paths, not random noise
Ablation flags — enable_reasoning=False / apply_hypotheses=False for controlled comparisons

Quick Start

cd eidos
pip install -r requirements.txt
pytest tests/                      # 70 unit tests
python run_all_experiments.py      # All 23 experiments + summary

Running Experiments (individual)

python experiments/exp_01_basic_prediction/run.py
python experiments/exp_02_memory_consolidation/run.py
python experiments/exp_03_reasoning_chain/run.py
python experiments/exp_04_reasoning_ablation/run.py   # v1.1: reasoning ON vs OFF
python experiments/exp_05_relational_inference/run.py  # v1.1: cat->dog->bone chain
python experiments/exp_06_reasoning_recovery/run.py  # v1.2: reasoning must beat ablation
python experiments/exp_07_multi_concept/run.py       # v1.3: multi-concept disambiguation
python experiments/exp_08_autonomous_recovery/run.py # v1.4: no external recovery hints
python experiments/exp_09_misleading_context/run.py  # failure: decoy episodic context
python experiments/exp_10_cold_start/run.py            # failure: no context / cold start
python experiments/exp_11_cls_recovery/run.py          # v2.0: sleep replay fixes 09+10
python experiments/exp_12_meta_misleading_context/run.py  # v3.0 A: meta detects decoy context
python experiments/exp_13_meta_ambiguous_reasoning/run.py # v3.0 B: ambiguous reasoning flags
python experiments/exp_14_meta_consequential/run.py       # v3.1: defer/sleep beats commit
python experiments/exp_15_active_epistemic_probe/run.py   # v4.0: epistemic probing
python experiments/exp_16_active_inference_ablation/run.py  # v4.0: active vs passive ablation
python experiments/exp_17_text_ambiguous_deferral/run.py    # v5.0: goal-directed text probe
python experiments/exp_18_text_session_memory/run.py        # v5.0: text + sleep recovery
python experiments/exp_19_hybrid_spike/run.py                 # v5.1: hybrid LLM gate
python experiments/exp_20_unified_gate/run.py                 # v6.0: unified gate
python experiments/exp_21_sbert_embeddings/run.py             # v6.0: SBERT embeddings
python experiments/exp_22_end_to_end/run.py                     # v6.2: full-stack E2E
python experiments/exp_23_eidos_eval/run.py                     # v7.0: EIDOS-Eval

Running Tests

pytest tests/

Running the Agent

python agent/run.py

Cognitive Primitives

15 primitives extracted from 15 source papers across 6 categories:

Prediction: Free Energy Minimisation (Friston 2010), Active Inference (Friston 2017), Hierarchical Predictive Coding (Rao & Ballard 1999)

Memory: Multicomponent Working Memory (Baddeley 2003), Global Workspace Broadcasting (Baars 1988), Neuronal Global Workspace (Dehaene 1998)

Learning: Hebbian Association (Hebb 1949), Complementary Learning Systems (McClelland 1995), Hippocampal Replay (Wilson & McNaughton 1994)

Attention: Goal-Directed Attention (Corbetta & Shulman 2002), Biased Competition (Desimone & Duncan 1995)

Reasoning: Dual-Process Theory (Kahneman 2011), Structure-Mapping Analogy (Gentner 1983)

Reward: Reward Prediction Error (Schultz 1997), Temporal Difference Learning (Sutton & Barto 2018)

Full entries in research/primitives/ and index in research/knowledge_base.json.

What This Is Not

Not AGI. EIDOS is a minimal laboratory model operating on 64-dimensional vectors.
Not an LLM replacement. It grounds phrases to vectors and gates LLM drafts; it does not train on or predict tokens at scale.
Not biologically accurate. It is inspired by neuroscience, not simulating neurons.
Not production-ready. No API, no deployment, no scaling claims.

EIDOS is a tool for understanding cognition computationally — a Kisamapa Labs experiment in building intelligence from first principles.

Future Expansion

Learned gate policy — tune thresholds from benchmark logs
Richer grounding — domain-specific embeddings beyond hash/SBERT

Hybrid Spike (LLM + EIDOS)

Optional demo — LLM generates, EIDOS gates (demos/hybrid_qa/):

py demos/hybrid_qa/run.py                    # mock LLM
pip install -r requirements-hybrid.txt      # for GPT-2 CPU
py demos/hybrid_qa/run.py --gpt2
py experiments/exp_19_hybrid_spike/run.py   # measurable gate vs blind LLM

KISAMAPA LABS — EXPERIMENT 06 — EIDOS v7.1 Classification: Research Prototype — Active

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
agent		agent
architecture		architecture
benchmark		benchmark
demos/hybrid_qa		demos/hybrid_qa
docs		docs
eval		eval
experiments		experiments
research		research
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LAB_REPORT.md		LAB_REPORT.md
README.md		README.md
SPEC.md		SPEC.md
requirements-embeddings.txt		requirements-embeddings.txt
requirements-eval.txt		requirements-eval.txt
requirements-hybrid.txt		requirements-hybrid.txt
requirements.txt		requirements.txt
run_all_experiments.py		run_all_experiments.py
run_analyze_reports.ps1		run_analyze_reports.ps1
run_analyze_reports.py		run_analyze_reports.py
run_compare_ablation.ps1		run_compare_ablation.ps1
run_compare_ablation.py		run_compare_ablation.py
run_live_eval.ps1		run_live_eval.ps1
run_live_eval.py		run_live_eval.py
run_nocache_70b_ablation.ps1		run_nocache_70b_ablation.ps1
update_paper_nocache_figure.py		update_paper_nocache_figure.py

Folders and files

Latest commit

History

Repository files navigation

EIDOS

At a glance

v7.5 — Mixed Benchmark + Abstention Calibration

v7.4 — TruthfulQA-Aligned Grading + Factual Gate

v7.3 — TruthfulQA N=50 + CoT Baseline

v7.2 — Calibrated Live Sidecar

v7.1 — Live Groq Eval + Belief-Grounded Sidecar

v7.0 — EIDOS-Eval + Metacognitive Injection

Biological Frameworks

Folder Structure

v6.2 — Real-World Benchmark

v6.1 — Ambiguous QA Benchmark + Exp 22

v6.0 — Unified Gate

v5.1 / Hybrid Spike — LLM + EIDOS

v5.0 — Language Grounding Bridge

v4.0 — Active Inference

v3.1 — Consequential Meta-Cognition

v3.0 — Meta-Cognition

v2.0 — Complementary Learning Systems (BeliefGraph + Sleep Replay)

v1.4 — Autonomous Recovery Targeting

Failure-Mode Experiments (v1.4)

v1.3 — Consolidation Preview (No Hardcoded Overrides)

v1.2 — Belief Consolidation + Nonlinear Prediction

v1.1 — Closed-Loop Reasoning

Quick Start

Running Experiments (individual)

Running Tests

Running the Agent

Cognitive Primitives

What This Is Not

Future Expansion

Hybrid Spike (LLM + EIDOS)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages