Production-grade data infrastructure for AI safety research, timeline events, and funding information.
For Developers: 5-Minute Integration Guide
For Data Engineers: Complete Integration Guide
For Navigation: Documentation Index
pdoom-data provides a curated, validated, and production-ready data lake for AI safety information. Data flows through a three-zone architecture (raw → transformed → serveable) with automated pipelines, comprehensive validation, and full provenance tracking.
1,028 Timeline Events (28 manual + 1,000 alignment research)
- Hand-curated AI safety events (2016-2025)
- Automated alignment research extraction (2020-2022)
- Schema-validated with complete source attribution
- Organized by year, category, and rarity
Alignment Research Dataset (1,000+ records)
- Research papers, blog posts, forum discussions
- 30+ sources (ArXiv, Alignment Forum, LessWrong, EA Forum)
- Automated weekly extraction with delta detection
- Enriched with metrics and derived fields
Funding Data (In Progress)
- Survival and Flourishing Fund (SFF) grants
- Grant amounts, recipients, project descriptions
- Historical funding patterns
Status: Production Ready | Events: 1,028 | Schema: event_v1.json
Two datasets available:
-
Manual Curated Events (28 events, 2016-2025)
- Organizational crises, technical breakthroughs, funding events
- Full source attribution and metadata
- Files:
all_events.json,by_year/,by_category/
-
Alignment Research Events (1,000 events, 2020-2022)
- Generated from StampyAI Alignment Research Dataset
- Research papers, forum posts, blog articles
- Files:
alignment_research/alignment_research_events.json,by_year/
Use Cases:
- Game timeline system with event impacts
- Research dashboard and visualization
- Historical analysis of AI safety field
- Training data for AI safety models
Status: Weekly Updates | Records: 1,000+ | Sources: 30+
- Automated extraction from Hugging Face dataset
- Schema validation and quality checks
- Cleaning pipeline (deduplication, ASCII conversion)
- Enrichment pipeline (metrics, topics, safety relevance)
Status: In Development | Sources: SFF, Open Philanthropy
- Grant amounts and recipients
- Project descriptions and outcomes
- Funding patterns over time
RAW ZONE TRANSFORMED ZONE SERVEABLE ZONE
(Immutable) (Validated/Cleaned/Enriched) (Production-Ready)
data/raw/ data/transformed/ data/serveable/
├── events/ ├── validated/ ├── api/
├── alignment_research/ ├── cleaned/ │ └── timeline_events/
└── funding/ └── enriched/ └── analytics/
Pipeline Stages:
- Raw: Immutable source data with checksums
- Validated: Schema-validated against JSON schemas
- Cleaned: Deduplicated, normalized, ASCII-compliant
- Enriched: Derived fields, metrics, categorization
- Serveable: Optimized for consumption (indexed, formatted)
Automation: GitHub Actions runs pipeline on data changes
See DATA_ZONES.md for architecture details.
pdoom1-website (PostgreSQL + FastAPI):
git submodule add https://github.com/PipFoweraker/pdoom-data.git data/pdoom-data
python scripts/import_events.py # Import 1,028 events to PostgreSQL
# API endpoint: GET /api/events?year=2024&category=technical_research_breakthroughpdoom (Godot Game):
cp pdoom-data/data/serveable/api/timeline_events/*.json res://data/events/
# Load events with EventLoader.gd, apply impacts to game variablespdoom-dashboard (React/TypeScript):
const { events } = useEvents({ year: 2024, category: 'technical_research_breakthrough' });
// Display interactive timeline with filteringSee QUICK_START_INTEGRATION.md for complete setup guides.
- Rigorous sourcing with complete attribution
- JSON Schema validation on all datasets
- ASCII-only encoding for universal compatibility
- Comprehensive extraction and transformation logs
- Idempotent pipelines (safe to re-run)
- Full version control and lineage tracking
| Metric | Value |
|---|---|
| Total Events | 1,028 |
| Schema Validation Pass Rate | 100% |
| ASCII Compliance | 100% |
| Source Attribution Complete | 100% |
| Duplicate Records | 0 |
1. Weekly Data Refresh (weekly-data-refresh.yml)
- Extracts new alignment research every Monday at 2am UTC
- Validates extracted data
- Commits to repository
2. Automated Pipeline (data-pipeline-automation.yml)
- Triggers on raw data changes
- Runs full pipeline: validate -> clean -> enrich -> transform -> manifest
- Auto-commits processed data
3. Documentation CI (documentation-ci.yml)
- Validates documentation quality
- Checks ASCII compliance
- Tests JSON validity
Interactive browser for reviewing and annotating events
Open tools/event_browser.html in your browser to:
- Browse and filter 1,000+ timeline events
- Add custom metadata for game integration
- Tag events by impact level and game relevance
- Export metadata for use in pdoom1 and pdoom1-website
See EVENT_BROWSER_GUIDE.md for complete documentation.
| Document | Purpose | Audience |
|---|---|---|
| QUICK_START_INTEGRATION.md | 5-minute integration | Developers |
| INTEGRATION_GUIDE.md | Complete integration docs | Developers |
| EVENT_SCHEMA.md | Timeline event schema | Developers |
| EVENT_BROWSER_GUIDE.md | Interactive event browser | Curators, Game Designers |
| DATA_ZONES.md | Architecture overview | Engineers |
| RUNBOOK.md | Operations guide | Operators |
| DOCUMENTATION_INDEX.md | All documentation | All |
See CROSS_REPO_INTEGRATION_ISSUES.md for ready-to-use GitHub issues to create in consuming repositories.
pdoom-data/
├── data/
│ ├── raw/ # Immutable source data
│ │ ├── events/ # Manual curated events
│ │ ├── alignment_research/ # Research dataset
│ │ └── funding/ # Funding data
│ ├── transformed/ # Processed data
│ │ ├── validated/ # Schema-validated
│ │ ├── cleaned/ # Normalized, deduplicated
│ │ └── enriched/ # With derived fields
│ └── serveable/ # Production-ready
│ ├── MANIFEST.json # Complete data catalog
│ └── api/
│ └── timeline_events/ # 1,028 events ready for use
├── tools/
│ └── event_browser.html # Interactive event browser (open in browser)
├── config/
│ └── schemas/ # JSON schemas
│ └── event_v1.json # Timeline event schema
├── scripts/
│ ├── analysis/ # Event analysis tools
│ ├── transformation/ # Data pipeline scripts
│ ├── validation/ # Schema validation
│ ├── publishing/ # Manifest generation
│ └── logging/ # (Future) Log consolidation
├── docs/ # Comprehensive documentation
└── .github/workflows/ # Automation
import json
from pathlib import Path
# Load manual events
with open('data/serveable/api/timeline_events/all_events.json') as f:
manual_events = list(json.load(f).values())
# Load alignment research events
with open('data/serveable/api/timeline_events/alignment_research/alignment_research_events.json') as f:
research_events = json.load(f)
all_events = manual_events + research_events
print(f"Loaded {len(all_events)} events")
# Filter by year
events_2024 = [e for e in all_events if e['year'] == 2024]
print(f" {len(events_2024)} events in 2024")-- After importing to PostgreSQL
SELECT id, title, year, rarity
FROM events
WHERE category = 'technical_research_breakthrough'
AND year >= 2020
ORDER BY year DESC, rarity DESC
LIMIT 10;# EventLoader.gd
var events = []
func _ready():
var file = File.new()
file.open("res://data/events/all_events.json", File.READ)
var json = file.get_as_text()
file.close()
var result = JSON.parse(json)
if result.error == OK:
for event_id in result.result:
events.append(result.result[event_id])
print("Loaded ", events.size(), " events")- Three-zone data lake architecture
- Timeline event schema and validation
- Alignment research integration (1,000 events)
- Automated weekly data extraction
- Complete transformation pipeline
- Serveable zone with manifest
- Integration documentation
- GitHub Actions automation
- Public communication strategy implementation
- Logs consolidation and public blog
- Additional funding data sources
- Data quality dashboard
- Public data portal (web UI)
- API documentation auto-generation
- Community contribution guide
- Data visualization toolkit
- Machine learning training datasets
See GitHub Issues for detailed tracking.
MIT License - Free for educational, research, and commercial use.
- Source: StampyAI Alignment Research Dataset
- License: Various (see individual records)
- Attribution: Full source URLs included in each record
- Curated by: pdoom-data team
- Sources: Public announcements, news articles, organizational updates
- Attribution: Complete source lists in each event
- Sources: Survival and Flourishing Fund, Open Philanthropy (planned)
- Attribution: Links to original grant databases
This repository is currently private during active development. A publishing workflow is configured to sync the serveable zone to a future public repository once data transformation pipelines are complete.
See docs/DATA_PUBLISHING_STRATEGY.md for details on the planned public data release strategy.
This repository is currently in active development. For questions or suggestions:
- Check DOCUMENTATION_INDEX.md
- Review existing GitHub Issues
- Open a new issue with your question or proposal
This repository maintains strict ASCII-only content for agent compatibility. See ASCII_CODING_STANDARDS.md and DEVELOPMENT_WORKFLOW.md.
Documentation: docs/DOCUMENTATION_INDEX.md
Quick Start: docs/QUICK_START_INTEGRATION.md
Issues: GitHub Issues
Integration Help: See issue templates in CROSS_REPO_INTEGRATION_ISSUES.md
Last Updated: 2025-11-24
Maintained by: pdoom-data team
Version: 0.2.0 (In Development)