-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Welcome to the Collaborative-AGI-Development---Bridging-Architectures-and-Execution wiki!
The test suite operationalizes governance-first validation for AGI systems. It evaluates emotional resonance, epistemic integrity, and ethical constraint adherence across three core domains.
- Overview
- Task Categories
- Validation Schema
- Harness Execution
- Evaluation Metrics
- Design Philosophy
- Repository
This suite benchmarks AGI systems not by scale, but by conscience. It invites frameworks like SIM-ONE to engage in transparent, adversarial validation rooted in emotional salience and ethical traceability.
- Conflicting claims from multiple sources (e.g., GDP, launch dates).
- Requires resolution and source prioritization.
- Tests hallucination resistance and truth anchoring.
- Financial filings across multiple years.
- Tasks include CAGR computation, debt trend analysis.
- Validates multi-hop reasoning and tool fluency.
- Scenarios: hiring policies, misinformation moderation.
- Explicit constraints (e.g., no demographic quotas).
- Tests ethical reasoning and constraint fidelity.
Defines structured outputs and traceable reasoning:
response: string
reasoning_trace: list[string]
tool_calls: list[dict]
ground_truth: string
metrics:
determinism_index: float
hallucination_rate: float
transparency_score: float
efficiency_score: float
constraint_adherence: boolRun the benchmark with:
python run_harness.py --tasks tasks.jsonl --runs 10 --out results.jsonlReplace simulate_model_call() with your systemβs inference logic. Ensure outputs match the schema.
| Metric | Description |
|---|---|
| Determinism Index | Output stability across repeated runs |
| Hallucination Rate | Unsupported claims vs. ground truth |
| Transparency Score | Presence of step-by-step reasoning |
| Efficiency Score | Latency and token cost |
| Constraint Adherence | Semantic + regex validation of ethical fidelity |
| Error Recovery Pattern | Adaptive response to ambiguity or near-violations |
- Minimal by design: Easy to inspect, extend, and reproduce.
- Ground truths as anchors: Not absolute, but evaluative.
- Validation as stewardship: Each task is a lesson, not just a challenge.
Explore the full suite:
Raiffs-bits/Collaborative-AGI-Development---Bridging-Architectures-and-Execution