This system allows for monitoring multiple FastAPI applications through WebSocket heartbeats. It consists of a manager service that monitors worker applications and detects if any of them go down or stop sending heartbeats.
- Manager: Central service that listens for heartbeats and monitors worker status
- Workers: FastAPI applications that register with the manager and send periodic heartbeats
- Redis: Used for storing worker registration information and status
- WebSocket-based async heartbeat system
- Automatic worker registration
- Redis for persistent worker information
- Docker setup for easy deployment
- Health status monitoring with configurable timeout
The system follows this architecture:
- Workers register with the manager upon startup
- Each worker establishes a WebSocket connection to the manager
- Workers send periodic heartbeats through the WebSocket connection
- Manager processes heartbeats and updates worker status in Redis
- Manager periodically checks worker status and marks workers as "not_responding" if heartbeats stop
Worker status information is stored in Redis using a Hash data structure:
- A single Redis Hash named
workerscontains all worker data - Each worker's ID is used as a field in the hash
- The value for each field is a JSON string containing worker details including:
- Worker ID and name
- Host and port
- Current status (registered, connected, alive, not_responding, disconnected)
- Last heartbeat timestamp
- This structure enables efficient storage and quick retrieval of worker information
The easiest way to run the entire system is with Docker Compose:
docker-compose upThis will start:
- 1 Redis instance
- 1 Manager service
- 4 Worker instances
If you want to run the components manually:
- Start Redis:
redis-server- Start the manager:
uvicorn manager:app --host 0.0.0.0 --port 8000- Start one or more workers (on different ports):
WORKER_NAME="Worker 1" WORKER_PORT=8001 uvicorn worker:app --host 0.0.0.0 --port 8001The system is configurable through environment variables:
REDIS_HOST: Redis host (default: "localhost")REDIS_PORT: Redis port (default: 6379)REDIS_DB: Redis database number (default: 0)HEARTBEAT_TIMEOUT: Time in seconds after which a worker is considered down (default: 15)
WORKER_NAME: Name of the worker (default: random name)WORKER_PORT: Port for the worker's FastAPI app (default: 8001)MANAGER_HOST: Host of the manager service (default: "localhost")MANAGER_PORT: Port of the manager service (default: 8000)HEARTBEAT_INTERVAL: Time in seconds between heartbeats (default: 5)
GET /workers: List all registered workers and their statusPOST /register: Register a new workerWebSocket /ws/{worker_id}: WebSocket endpoint for worker heartbeats
GET /health: Health check endpoint
To simulate a worker failure, you can stop one of the worker containers:
docker-compose stop worker2After the heartbeat timeout period (default 15 seconds), the manager will mark the worker as "not_responding".
The system uses Python's built-in logging module with the following features:
- Configurable log levels (default: INFO)
- Timestamp and component identification in log entries
- Different loggers for manager and worker components
- Structured logging of important events:
- Worker registration/disconnection
- WebSocket connection status
- Heartbeat processing
- Worker status changes
When running in Docker, logs are directed to stdout/stderr and can be viewed with:
docker-compose logs -f manager # View manager logs
docker-compose logs -f worker1 # View worker1 logsCurrent monitoring options:
- API Endpoint: The manager provides a
/workersendpoint that returns real-time status of all registered workers - Container Health: Docker health checks are configured for Redis
The system is designed to be extended with more advanced monitoring:
-
Prometheus Integration: Metrics endpoints could be added to expose:
- Worker registration counts
- Heartbeat latency
- Worker uptime statistics
- Status change events
-
Alerting: Integration with alert managers for notification when workers go down
-
Dashboard: A simple web UI could be added to visualize worker status