A Go-based task scheduling system that accepts jobs over HTTP, stores state in Postgres, and fans work out to a Redis-backed, priority-aware worker pool. Prometheus metrics, alert rules, and a Grafana dashboard are included so you can observe queue health, throughput, retries, and handler latency out of the box.
⚠️ Work In ProgressThis project is under active development. Interfaces, environment variable names, metrics, and dashboard queries may change without notice. Known rough edges / TODOs include:
- Retry & DLQ handling: logic is functional but may evolve (recording richer failure metadata, bulk retry tooling, DLQ reprocessing CLI).
- Grafana dashboard: current panels use longer PromQL windows; a “realtime” dashboard variant and short-window panels are planned.
- Logging: plain
log.Printf
(no structured/JSON logger); worker and load-generator logs can be noisy at high throughput.- Backoff configuration: defaults exist, but advanced per-task tuning and jitter strategies are not finalized.
- Operational hardening: no auth, no rate limits, no multi-tenant isolation yet.
- Test coverage: integration & property tests for queue/retry pathways still to be added.
- Priority queues (
critical
,high
,default
,low
) with Lua-assisted fairness. - Worker pools powered by goroutines; scale horizontally by running more processes.
- Visibility timeout plus reaper for at-least-once delivery guarantees.
- Exponential backoff with jitter, retries, and per-priority dead-letter queues.
- Idempotency keys and Postgres persistence for reliable delivery semantics.
- Prometheus metrics, alert rules, and a Grafana dashboard for live observability.
- Stress-test load generator with a configurable workload mix.
- API service (
cmd/api
) – Persists incoming tasks (with optional idempotency keys) and enqueues lightweight messages to Redis. - Redis queue (
internal/queue
) – Maintains four priority lists plus processing, retry, and dead-letter collections. Lua scripts ensure the next ready payload is popped atomically across priorities. - Worker pool (
cmd/worker
,internal/worker
) – Leases work with a visibility timeout, invokes handlers, manages exponential backoff retries, re-enqueues timed-out payloads, and records task status in Postgres. - Postgres (
migrations/
) – Holds the authoritative task record, attempts, idempotency mapping, and per-run history. - Observability – Prometheus metrics, alert rules (
alerts.yml
), and a Grafana dashboard (grafana/dashboards/observability.json
).
cmd/api
– HTTP API for enqueueing tasks and exporting metrics.cmd/worker
– Runnable worker with pool, retry pump, and reaper.cmd/load
– Load generator for stress testing.internal/config
– Environment-driven configuration with sane defaults.internal/queue
– Redis list/ZSET operations and queue metadata helpers.internal/worker
– Pool implementation, backoff math, reaper, retry pump, and metrics sampling.internal/metrics
– Prometheus counters, histograms, and gauges.internal/db
– Postgres connection helpers.migrations/
– SQL schema migrations.docker-compose.yml
– Postgres, Redis, Prometheus, Grafana dev stack.
- Go 1.23+
- Docker and GNU Make
- Bring up infra
make dev
- Apply migrations (run both scripts;
make migrate
streams001_init.sql
)make migrate docker exec -i $(docker ps -qf name=postgres) psql -U app -d app < migrations/002_idempotency.sql
- Run API – HTTP
:8080
, metrics:2112
make api
- Run worker – Metrics exposed on
:2113
make worker
- Enqueue a task
curl -s -X POST localhost:8080/tasks \ -H 'Content-Type: application/json' \ -d '{ "type": "noop", "priority": "high", "payload": {"example": true}, "idempotency_key": "demo-123" }'
- Generate load (optional)
go run ./cmd/load -n 2000 -w 8 -pct-flaky 15 -pct-fail 5
Scale tip: override the environment knobs in the Makefile when you need more throughput or resilience. For example, run more processes or bump WORKER_POOL
, RETRY_*
, and other settings inline (WORKER_POOL=24 RETRY_BATCH_SIZE=250 make worker
). The configuration table below lists every tunable value.
Prometheus scrapes the API (host.docker.internal:2112
) and worker (host.docker.internal:2113
) every second. Grafana runs at http://localhost:3000 with the dashboard pre-provisioned.
All binaries read from the environment via internal/config/config.go
. Defaults aim for local development.
Variable | Default | Description |
---|---|---|
POSTGRES_URL |
postgres://app:app@localhost:5432/app?sslmode=disable |
Postgres DSN. |
REDIS_ADDR |
localhost:6379 |
Redis address. |
REDIS_PASS |
(empty) | Redis password if required. |
METRICS_ADDR |
:2112 |
Prometheus metrics bind address (set to :2113 for the worker via Makefile). |
WORKER_POOL |
12 |
Goroutines per worker process. |
PGPOOL_MAX_CONNS |
20 |
Max Postgres connections per worker. |
REDIS_POOL_SIZE |
50 |
Redis client pool size. |
REDIS_MIN_IDLE |
10 |
Minimum idle Redis connections. |
VIS_TIMEOUT_MS |
60000 |
Visibility timeout in milliseconds. |
REAPER_SCAN_INTERVAL_MS |
2000 |
Reaper cadence in milliseconds. |
REAPER_BATCH_SIZE |
100 |
Reaper batch size. |
METRICS_SAMPLE_MS |
500 |
Gauge sampling interval in milliseconds. |
RETRY_BASE_MS |
500 |
Retry base backoff. |
RETRY_MAX_MS |
30000 |
Retry max backoff. |
RETRY_JITTER_MS |
500 |
Backoff jitter. |
RETRY_SCAN_INTERVAL_MS |
500 |
Retry pump cadence in milliseconds. |
RETRY_BATCH_SIZE |
100 |
Retry pump batch size. |
- Enqueue – API validates input, persists the task, records idempotency metadata, and pushes a lean payload to
q:<priority>
in Redis. - Dispatch – Worker pops the next ready task across priorities via Lua and places it on the processing list with a visibility deadline.
- Execution – Handler runs with the unmarshalled payload (
noop
,flaky
,always_fail
,sleeper
, or custom logic). - Success – Worker ACKs the payload, clears the deadline, marks Postgres
status='succeeded'
, and emits success metrics. - Failure – Attempts and max retries are read from Postgres. Tasks retry with exponential backoff plus jitter until max retries, otherwise they land in the DLQ.
- Reaper & Retry Pump – Background goroutines reclaim expired leases and requeue retry-eligible tasks while keeping gauges current.
Counters – tasks_enqueued_total
, tasks_dequeued_total{priority}
, tasks_succeeded_total{type,priority}
, tasks_failed_total{type,priority}
, tasks_retried_total
, tasks_reaped_total
, acks_total
.
Histograms – task_run_seconds{type}
, enqueue_to_start_seconds{priority}
.
Gauges – queue_depth{priority}
, retry_set_depth{priority}
, dlq_depth{priority}
, processing_inflight
.
Alert examples in alerts.yml
cover queue saturation, elevated failure rate, and slow p95 handler latency. The Grafana dashboard visualizes queue depth, throughput, success vs failure, retry/DLQ size, handler latency, and in-flight processing.
- Ensure the dev stack is running (
make dev
starts Grafana alongside Postgres/Redis/Prometheus). - Visit http://localhost:3000 and log in with the default credentials (
admin
/admin
). - Grafana immediately prompts you to set a new password—follow the prompt.
- A Prometheus data source is pre-provisioned (points to
http://prometheus:9090
inside Docker, which resolves to Prometheus). - Open the Scheduler / Observability dashboard (already imported from
grafana/dashboards/observability.json
) to visualize queue depth, retries, latency, and worker health.
Below are example Grafana dashboards visualizing queue health, throughput, retries, and latency:
Tune the load generator to explore different mixes:
go run ./cmd/load \
-n 10000 -w 16 \
-pct-noop 80 \
-pct-flaky 10 \
-pct-fail 5 \
-pct-sleep 5 \
-sleep-ms 80000 \
-flaky-attempts 2
This simulates:
noop
– Instant success.flaky
– Fails the first N attempts then succeeds.always_fail
– Permanent failure → DLQ.sleeper
– Sleeps longer than the visibility timeout → reaper reclaim.
Scale worker processes and goroutines, then watch queue depth, throughput, retries, and DLQ growth in Grafana.
Example local run with 3 worker processes × 12 goroutines each:
Burst | Workers × Goroutines | In-flight | Drain Time | Throughput | p95 Run | p95 Wait | Fail % | Retries | Reaped |
---|---|---|---|---|---|---|---|---|---|
10k | 3 × 12 | ~36 | 22s | ~450/s | 25 ms | 0.8 s | 2% | 180 | 12 |
- Multi-tenant quotas and rate limiting to prevent noisy neighbors.
- DAG-style workflows with fan-out/fan-in dependencies.
- Autoscaling policies tied to queue depth and handler latency.
- Kubernetes deployment manifests.
- gRPC-based worker communication channel.
This project demonstrates systems-level backend skills (concurrency, retries, DLQs, reapers), infrastructure thinking (Redis patterns, DB pooling, idempotency), and an observability mindset (Prometheus histograms, Grafana dashboards, alerts). It highlights the ability to build distributed, reliable, observable systems end to end.