Transactional workflow engine designed to prevent state corruption in payment and finance systems.
Built to enforce exactly-once state transitions under concurrency.
Companion system: async-rag-ingestion-engine
Payment systems fail when state becomes ambiguous. This engine eliminates that failure mode.
Guarantees:
- States are FSM-validated (no arbitrary transitions)
- State changes are exactly-once (optimistic locking)
- Events are reliably published (transactional outbox)
- History is immutable (append-only audit log)
Built for correctness over convenience.
Handles:
- Strict state transitions via Finite State Machine
- Exactly-once state changes (DB transaction + version check)
- Reliable Kafka publishing (transactional outbox pattern)
- Full audit trail (immutable records)
Does NOT handle:
- Document ingestion (different failure mode)
- Vector storage (wrong bounded context)
- Retry-heavy pipelines (isolated to ingestion layer)
Separation prevents failure modes from bleeding across systems.
sequenceDiagram
participant Client
participant API
participant Service
participant DB as PostgreSQL
participant Outbox
participant Publisher
participant Kafka
Client->>API: POST /transactions/{id}/transition
API->>Service: Validate + Apply FSM
Service->>DB: BEGIN TRANSACTION
Service->>DB: UPDATE state (version check)
Service->>Outbox: INSERT event
Service->>DB: COMMIT
DB-->>API: Success
API-->>Client: 200 OK
Note over DB,Outbox: Atomic Write
Publisher->>Outbox: Poll unpublished events
Publisher->>Kafka: Publish event
Publisher->>Outbox: Mark published
Key invariant: State change and outbox write happen in one atomic transaction.
States cannot be set arbitrarily. Each transition is explicitly allowed or rejected.
ALLOWED_TRANSITIONS = {
"CREATED": ["PENDING"],
"PENDING": ["COMPLETED", "FAILED"],
"COMPLETED": [], # Terminal state
"FAILED": ["PENDING"], # Allow retry
}Examples:
CREATED → PENDING✅PENDING → COMPLETED✅FAILED → COMPLETED❌ (Illegal - fails loudly)
Invalid transitions fail fast, preventing silent corruption.
Publishing directly to Kafka from request handlers is unsafe (dual-write problem: DB succeeds, Kafka fails → inconsistency).
Our approach:
- Begin DB transaction
- Apply state change
- Insert event to
outboxtable - Commit transaction
- Background worker publishes to Kafka
Result: No lost events, no phantom events. DB and Kafka stay consistent.
Concurrent updates use version checks:
UPDATE transactions
SET state = 'COMPLETED', version = version + 1
WHERE id = '...' AND version = 5Under contention:
- First request succeeds (version match)
- Second request fails (version mismatch → 409 Conflict)
Enforces single-winner semantics without heavyweight locks.
All state-changing operations require X-Idempotency-Key header.
On retry:
- Same key → same outcome (no duplicate state change)
- Same key → same event (no duplicate Kafka message)
- Safe to retry unconditionally
Prevents double-settlement in payment workflows.
State records are immutable:
- No
UPDATEstatements on historical data - Every change is timestamped
- Full replay capability for forensics
Enables compliance audits and deterministic debugging.
| Scenario | Behavior | Outcome |
|---|---|---|
| API retry | Idempotency key prevents duplication | Safe |
| Service crash | DB transaction rolls back | No partial state |
| Kafka down | Events remain in outbox | Eventual publish |
| Concurrent updates | Optimistic lock rejects one | No corruption |
| Duplicate Kafka messages | Consumers use event_id to deduplicate |
Safe |
Zero undefined states.
{
"acks": "all", # Wait for all replicas
"enable.idempotence": True, # Prevent duplicates
"key": aggregate_id, # Per-transaction ordering
}Trade-off: Duplicate Kafka messages are acceptable. Duplicate state changes are impossible.
Simulates 50 concurrent requests trying to modify the same transaction:
python scripts/benchmark_state_safety.pyResult:
🚀 Starting Safety Benchmark: 50 concurrent requests...
🎯 Target: Single Wallet (wallet_8f3a...)
📊 Results:
✅ Successful Transitions: 1
🛡️ Blocked Conflicts (409): 49
❌ Other Errors: 0
✅ PASSED: Perfect safety. No double-spends detected.
Proves optimistic locking works under contention.
| Concern | Ingestion Engine | Transaction Engine |
|---|---|---|
| Failure mode | Partial/bad data | State corruption |
| Retry strategy | Automatic | Explicit (idempotent) |
| Consistency | At-least-once | Exactly-once (state) |
| Optimization | Throughput | Correctness |
| Scaling constraint | I/O bound | Validation/lock bound |
This split isolates failure domains and simplifies reasoning about edge cases.
- AI inference (wrong layer)
- Vector databases (different bounded context)
- Authentication (orthogonal concern)
- UI/dashboards (API-first design)
- Distributed sagas (not needed for single-DB transactions)
- "Exactly-once delivery" (theoretically impossible)
# Start infrastructure (Postgres, Redpanda, Redis)
docker compose up -d
# Run migrations
alembic upgrade head
# Start API
uvicorn src.main:app --reload
# Start outbox publisher (separate terminal)
python src/cmd/outbox_publisher.pyStart here to understand correctness guarantees:
- FSM rules:
src/domain/fsm.py - Optimistic locking:
src/adapters/repository.py - Outbox flow:
src/adapters/outbox.py - Safety benchmark:
scripts/benchmark_state_safety.py
State correctness is a design problem, not a retry problem.
This system makes invalid states unrepresentable.