A distributed, fault-tolerant realtime notification delivery platform designed for reliable event delivery, fanout scaling, and failure recovery in modern backend systems.
Modern applications require reliable delivery of realtime events under unstable network conditions, disconnected clients, and backend failures. This platform handles reliable event delivery to both online and offline users, manages connection lifecycles via WebSockets, and uses Redis Streams to guarantee message processing.
graph TD;
Client-->|POST /notifications|API;
API-->|XADD|RedisStreams[Redis Streams Notifications];
API-->|Insert Pending|PostgreSQL;
RedisStreams-->|XREADGROUP|StreamWorker[Stream Consumer Worker];
StreamWorker-->|Check Presence|RedisPresence[Redis Online Users Set];
StreamWorker-->|Publish to WS|RedisPubSub[Redis PubSub];
StreamWorker-->|Update Status|PostgreSQL;
StreamWorker-.->|Failed|RedisRetries[Redis ZSET Retries];
RedisPubSub-->|Listen|API_WS[API WS Gateway];
API_WS-->|Send|WebSockets;
RetryWorker-->|Poll|RedisRetries;
RetryWorker-->|Re-Process|StreamWorker;
RetryWorker-.->|Max Retries|DLQ[PostgreSQL DLQ];
We use Redis Streams (notifications:stream) with Consumer Groups to fan out the processing across horizontal worker pools. Fast ingestion combined with asynchronous, parallel processing ensures the API never blocks.
WebSockets are handled efficiently via FastAPI. Nodes maintain a local registry of connections and subscribe to a central Redis Pub/Sub channel. This allows the API to be completely stateless—if an event is destined for a user connected on Node A, Node B can publish it to Redis, and Node A will deliver it.
The DeliveryEngine utilizes PostgreSQL to maintain DeliveryAttempts.
- Transient failures in worker processing place events into a Redis Sorted Set (
ZSET) based on an exponential backoff timestamp. - A dedicated
RetryWorkerpolls this set and re-processes messages. - Once a message exceeds
MAX_RETRIES, it is permanently moved to theDLQEventtable for manual intervention.
When users are disconnected, messages are routed to an OfflineQueue in PostgreSQL. Clients can fetch their missed notifications on reconnect via GET /notifications/offline/{user_id} and acknowledge them upon receipt to clear the queue.
- Bounded ingestion via Redis Streams prevents the database from being overwhelmed.
- Exponential backoff prevents retry storms from overwhelming downstream APIs.
- Consumer Groups allow scaling consumers out horizontally as load increases.
-
Start the Platform
docker-compose up -d --build
-
Endpoints
- Ingest:
POST http://localhost:8000/notifications - WS Gateway:
ws://localhost:8000/ws/{user_id} - Offline:
GET http://localhost:8000/notifications/offline/{user_id}
- Ingest:
This platform is rigorously tested under load. Ensure you have k6 installed locally, then run the load tests against the running docker services.
# 1. High WebSocket Concurrency (simulates 500 active users pinging the WS gateway)
k6 run scripts/k6/websocket-storm.js
# 2. Large Fanout Delivery (Simulates sending an event to 100s of users at once)
k6 run scripts/k6/notification-fanout.js
# 3. Reconnect Storms (Simulates clients rapidly connecting and disconnecting)
k6 run scripts/k6/reconnect-chaos.js
# 4. Retry Amplification (Simulates failing events triggering retry logic)
k6 run scripts/k6/retry-storm.js