This plan is intentionally not coverage-first. For a queue/workflow library, trust comes from proving behavioral guarantees across drivers under normal load, contention, and failure.
Coverage remains a useful regression signal, but it is secondary to cross-driver scenario validation.
- Test guarantees, not internal implementations.
- Run the same scenario names across all supported drivers whenever possible.
- Capability-gate only where a backend truly cannot provide a guarantee.
- Document every unsupported or weaker guarantee explicitly.
- Treat flaky tests as product issues until proven infra-only.
Before v1.0.0, users should be able to rely on:
- Clear delivery and retry semantics (including duplicate-delivery expectations)
- Documented ordering guarantees (and non-guarantees) per driver
- Safe worker lifecycle behavior (
StartWorkers,Shutdown) under repeated calls and races - Predictable workflow behavior (chain/batch state transitions, callbacks, terminal states)
- Capability-consistent behavior for pause/stats/observability
- Recovery behavior across restart/outage scenarios where supported
Primary command:
GOCACHE=/tmp/queue-gocache go test ./...What this currently gives us:
- Core queue API behavior in the root module
busworkflow runtime/store semantics and callback idempotency tests- Fake queue behavior used by application tests
- Internal bridge regressions (
internal/driverbridge) - API-level behavior and option handling
Script:
./scripts/test-all-modules.shModes:
- default: compile-only (
-run '^$') FULL=1: full tests per module
What this gives us:
- cross-module build integrity (
GOWORK=offin submodules) - optional driver modules still compile/test independently
- examples and integration module stay wired correctly
Primary command:
INTEGRATION_BACKEND=all GOCACHE=/tmp/queue-gocache go test -tags=integration ./integration/... -count=1Current integration coverage is already strong and includes:
- shared queue-runtime scenarios in
integration/all/integration_scenarios_test.go - queue/workflow API integration in
integration/all/runtime_integration_test.go - bus integration in
integration/bus/integration_test.go - observability integration in
integration/root/observability_integration_test.go - SQL callback duplicate-suppression integration tests in
integration/bus/callback_sql_integration_test.go
Related documentation and CI checks:
docs/integration-scenarios.md.github/scripts/check_integration_scenarios_contract.sh
Script:
scripts/coverage-codecov.shCurrent role:
- merges unit + integration-tagged coverage
- tracks broad regressions
- not used as a substitute for guarantee validation
Commands used during API changes:
GOCACHE=/tmp/queue-gocache go run ./docs/readme/main.go
GOCACHE=/tmp/queue-gocache go run ./docs/examplegen/main.go
cd examples && GOCACHE=/tmp/queue-gocache go test ./... -run '^TestExamplesBuild$' -count=1What this gives us:
- generated docs/examples remain in sync
- example code compiles
Gap still open:
- manual README snippets are not automatically compile-checked
This section is the core of the plan. Each area should have:
- a documented guarantee (or explicit non-guarantee)
- shared scenario coverage (where possible)
- capability gates only where unavoidable
- Delivery is at-least-once (if that is the intended contract)
- Duplicate delivery is possible and callers must design handlers accordingly
- A job is not considered complete until handler success is acknowledged by the backend/runtime path
- Handler error behavior maps to retry/archive semantics as documented
- poison message and max retry handling
- duplicate-delivery idempotency scenario
- restart recovery scenarios
- broker fault / recovery scenarios (capability-gated)
- Explicit ack-boundary invariants under worker interruption
- example: handler side effect committed, ack path interrupted, duplicate delivery occurs; verify idempotency pattern and state consistency
- “success exactly once” is not promised; test and document “side-effect idempotency required” with reference scenario
- Delayed job + restart + recovery invariants for all backends that claim durable delay/retry behavior
- Retry count increments correctly
- Retry exhaustion leads to documented terminal behavior
- Delay/retry scheduling does not execute before configured time window (allow backend timing tolerance)
- Backoff strategy is respected within documented tolerance, not exact timestamps
- poison-message retry ceiling
- shutdown during delay/retry
- mixed option fuzz scenarios
- backpressure and delayed workloads in baseline suite
- Timing-window assertions for delay/backoff (early execution is a correctness bug)
- Backoff monotonicity under load (retry N+1 should not run before retry N for same job path)
- Deadline/timeout interaction with retry policy (especially per-driver timeout implementations)
- Jitter behavior (if supported/documented) should be tested as range-based, not exact
Per driver, explicitly state one of:
- no ordering guarantee
- best-effort FIFO per queue
- FIFO only under constrained conditions (single worker, no retry, no delay, no priority)
scenario_ordering_contractwith capability gating
- Split ordering scenarios by condition, not one broad “ordering”
- single worker / no retry
- multi-worker contention
- retries injected (ordering should be allowed to break if documented)
- delayed + immediate mix
- Assert documented non-guarantees
- example: prove ordering may break under multi-worker concurrency for drivers where only best-effort ordering exists
- Add a driver-facing matrix in docs that names exact ordering conditions
StartWorkersis safe/idempotentShutdownis safe/idempotent- Shutdown behavior for in-flight jobs is documented
- graceful completion, interruption, retry rescheduling, or handoff behavior
scenario_startworkers_idempotentscenario_shutdown_idempotentscenario_shutdown_during_delay_retry- worker restart recovery scenarios
- Shutdown during active handler execution with long-running jobs
- verify resulting job state and retry behavior
- Start/shutdown race stress (rapid worker joins/leaves)
- Repeated startup/shutdown cycles on same queue under light load (resource leak smoke)
Dispatch(job)uses default background context behaviorDispatchCtx(ctx, job)obeys cancellation/deadline for enqueue operation- Context cancellation should not enqueue the job if cancellation occurs before acceptance (within documented backend tolerance)
scenario_dispatch_context_cancellation- dispatch burst and dispatch under broker fault scenarios
- Distinguish pre-canceled vs mid-flight canceled contexts
- Deadline exceeded error shape consistency (or documented backend-specific error wrapping)
- Queue-level defaults vs per-call context interaction (if queue-level timeouts/controls are configurable)
This is a trust-critical area. Users will assume high-level workflow helpers encode strong invariants.
- Chain steps advance only after prior step success
- Chain failure triggers catch/finally exactly once per workflow (under documented duplicate-delivery assumptions)
- Batch terminal state reflects child outcomes correctly
- Batch
then/catch/finallycallbacks run according to documented rules and are duplicate-safe - Workflow state lookup/prune behavior is consistent and documented
- root
bustests cover chain/batch lifecycle and callback de-duplication - SQL store contract tests cover callback marker/idempotency behavior
- integration queue and bus tests cover chain/batch end-to-end scenarios
- SQL runtime callback duplicate suppression integration tests exist
- Fault-injected workflow callback delivery duplication across more backends (where callback path traverses queue runtime)
- Callback failure retry semantics (if callbacks retry / fail terminally, document and test)
- Workflow + queue outage interactions
- enqueue succeeds for workflow record but queue dispatch fails (and recovery semantics)
- Multi-worker concurrent workflow progression stress
- ensure no double-advance / double-terminal transitions
SupportsPause/Pause/Resumebehavior is correct and capability-gatedSupportsNativeStats/Statsbehavior is correct and capability-gated- Observer event emission for key lifecycle transitions is stable enough for docs/contracts
- observability integration suite
- pause/resume support integration checks
- internal bridge tests now guard stats/pause passthrough regressions
- Event ordering/sequence assertions for critical flows (dispatch -> start -> retry/fail -> success/archive)
- allow tolerance where async delivery makes exact global ordering impossible
- Error-path observability assertions (broker fault, dispatch cancellation, callback failure)
- “no false support claims” checks across wrappers/adapters (this regressed once; keep pressure on it)
- System maintains forward progress under bounded saturation
- Backpressure errors/blocking behavior is documented (if applicable)
- Large payload handling limits and error behavior are documented
scenario_backpressure_saturationscenario_payload_large
- Explicit size limit boundary tests (just below / at / just above limit if limits are known)
- Recovery after saturation (normal traffic resumes cleanly)
- Queue depth / latency threshold assertions (range-based, backend-tolerant)
scenario_config_option_fuzz- invalid JSON bind scenarios
- Invalid option combinations with explicit error shape assertions
- Fuzz/property tests for payload decoding and queue-name normalization
- Config defaults invariants (documented defaults should be test-locked)
- Create/expand a driver guarantee matrix and link every guarantee to a proving test/scenario
- Location:
docs/backend-guarantees.md(or new dedicated matrix section/file)docs/integration-scenarios.md
- Acceptance:
- every supported backend row includes: ordering, pause/resume, native stats, restart recovery, broker fault injection support, delay/retry durability
- every capability/guarantee entry points to at least one scenario/test name
- all “unknown” cells are resolved to supported / unsupported / best-effort
- Notes:
- capability-gated skips in integration tests must match the matrix
Create or expand a table that maps each driver to guarantee strength for:
- ordering
- pause/resume
- native stats
- restart recovery
- broker fault injection support
- delay/retry durability
Each row must point to scenario names/tests that prove it.
Why:
- This closes the gap between “we think this driver supports X” and “we can prove X in CI”
- Split ordering coverage into condition-specific scenarios and align docs with exact guarantees
- Location:
integration/all/integration_scenarios_test.godocs/integration-scenarios.mddocs/backend-guarantees.md
- Acceptance:
- add scenarios (names can vary, but should be explicit) for:
- single-worker FIFO
- multi-worker non-FIFO / best-effort behavior
- retry-induced reordering
- delayed + immediate mix ordering behavior
- each scenario is capability-gated only where needed
- docs state exact ordering preconditions per backend (workers/retries/delay constraints)
- add scenarios (names can vary, but should be explicit) for:
- Notes:
- include at least one “non-guarantee” assertion to prevent accidental over-promising
Refactor scenario_ordering_contract into sub-scenarios with explicit preconditions:
- single worker FIFO
- multi-worker no FIFO guarantee
- retry-induced reorder allowed
- delayed/immediate mix
Why:
- Ordering bugs and over-promises are the biggest trust killers in queue systems
- Add tolerance-based timing-window assertions for delay/retry/backoff behavior
- Location:
integration/all/integration_scenarios_test.godocs/integration-scenarios.md(timing tolerance documentation)
- Acceptance:
- delayed jobs are asserted not to execute before the configured delay window (with backend tolerance)
- retries are asserted not to execute before the expected backoff window (with backend tolerance)
- late execution is tolerated within documented limits; early execution fails
- thresholds are configurable similar to existing scenario duration guardrails
- Notes:
- use range assertions; do not assert exact timestamps
Add tolerance-based assertions that jobs do not execute earlier than configured delay/backoff windows.
Why:
- “Runs too early” is a correctness bug; “runs a bit late” is usually capacity/timing
- Expand workflow integration coverage for callback duplication/failure/recovery semantics
- Location:
integration/all/runtime_integration_test.gointegration/bus/integration_test.gointegration/bus/callback_sql_integration_test.go(extend or mirror patterns)- workflow docs (
README.md/ workflow docs if applicable)
- Acceptance:
- callback duplicate delivery is tested under at least one fault/recovery path
- callback failure behavior (retry/terminal/catch/finally semantics) is explicitly asserted and documented
- workflow state remains consistent after partial dispatch failures (workflow record + queue enqueue mismatch paths)
- no double-advance / double-terminal transition under concurrent processing in covered scenarios
- Notes:
- this is a trust-critical P0 item, not optional polish
- Progress: cross-backend callback failure semantics (catch/finally + terminal state) are covered in
integration/bus/integration_test.go; SQL runtime/store integration now covers chain + batch duplicate callback suppression, callback replay after callback-dispatch fault (chain final callback), and chain/batch dispatch failure state consistency (including batch partial-dispatch-failure-after-progress)
Extend workflow integration scenarios to cover:
- callback duplicate delivery under queue/runtime faults
- callback failure and retry/terminal semantics
- workflow state consistency after dispatch partial failures
Why:
- Workflow helpers amplify queue semantics; trust here matters more than raw queue enqueue tests
- Add race detection to CI release gate
- Location:
- CI workflow(s) in
.github/workflows/* test-plan.md/ release gate docs if split out
- CI workflow(s) in
- Acceptance:
- PR or required CI runs
go test -raceon an agreed scope - nightly or scheduled CI runs full
go test -race ./...if PR scope is reduced - failures are triaged as product bugs unless proven infrastructure/tooling issues
- PR or required CI runs
- Notes:
- Current PR CI scope is the root module race run in
.github/workflows/test.yml(racejob:go test -race ./...) - Integration-tagged race coverage is not in the PR gate and should be treated as future hardening if needed
- Current PR CI scope is the root module race run in
At least:
GOCACHE=/tmp/queue-gocache go test -race ./...If too slow:
- required subset in PR CI
- full repo race in nightly
- Add automated compile-checking for selected manual
README.mdGo snippets - Location:
- new script/tool (e.g.
scripts/check-readme-snippets.shortools/readmecheck) - CI workflow step
- new script/tool (e.g.
- Acceptance:
- curated manual snippets are extracted or mirrored and compile-checked in CI
- the
Dispatch/handler-signature drift class is caught automatically - failures report which snippet section broke
- Notes:
- start curated; full Markdown fence extraction can come later
Automate compile-checking of selected manual snippets.
Why:
- v1 trust includes accurate docs
- Expand scheduled adversarial integration scenarios for backend flaps and worker churn
- Location:
integration/all/integration_scenarios_test.godocs/integration-scenarios.md(chaossection)- CI scheduled workflow(s)
- Acceptance:
- scenarios cover broker disconnect during handler execution, reconnect/redelivery, and dispatch during backend flap (where supported)
- worker restart churn under load is exercised on restart-capable backends
- scenario results are visible in CI artifacts/logs with backend + scenario naming
- Notes:
- capability-gate unsupported fault injection paths explicitly
- Implemented in
.github/workflows/soak.ymlintegration-chaossubset with shared scenario names aligned to current suite (scenario_dispatch_during_broker_fault,scenario_consume_after_broker_recovery,scenario_worker_restart_recovery,scenario_worker_restart_delay_recovery, plus contention/shutdown race probes); results are emitted with backend+scenario duration lines and uploaded per-backend logs
Expand scheduled integration scenarios for:
- broker disconnect during handler execution
- reconnect and redelivery behavior
- dispatch while backend is flapping
- worker restart churn under load
Why:
- Real production incidents happen in these edges, not in happy paths
- Add repeated-run soak gate for timing/concurrency-sensitive scenarios with flake tracking
- Location:
- CI scheduled workflow(s) / release-candidate workflow
docs/integration-scenarios.md- flake log doc (new)
- Acceptance:
- selected scenarios run repeatedly per backend (or backend subsets)
- flake rate is recorded by backend/scenario
- release candidates require manual review of recent flake results
- Notes:
- focus on contention, retry timing, shutdown races, ordering
- Implemented via
.github/workflows/soak.ymlintegration-flake-repeat(scheduled + manual) usingscripts/integration-flake-repeat.sh; current backend subset isredis,rabbitmq,sqswith per-scenario flake-rate summaries/artifacts indocs/flake-log.mdreview format
Run critical scenarios repeatedly (nightly/RC gate):
- multi-worker contention
- duplicate-delivery idempotency
- shutdown during delay/retry
- ordering sub-scenarios
Track flake rates by backend/scenario.
- Add explicit error contract tests for user-facing error classes
- Location:
- root tests (
*_test.go) integration/allfor backend/dispatch errorsintegration/root/integration/busfor workflow/state errors
- root tests (
- Acceptance:
- invalid config errors are asserted (not just non-nil)
- unsupported capability operations have stable/documented error behavior
DispatchCtxcancellation/deadline errors are asserted by class/message contract- workflow not-found / invalid-state errors are covered
- Notes:
- avoid overfitting exact wrapped error strings unless intentionally part of API
- Implemented across root + integration suites: root-level
error_contract_test.gocovers high-levelQueue.DispatchCtxcancellation/deadline classes (deterministic saturation), unsupported capability errors forQueue.Pause,Queue.Resume,Queue.Stats,ErrWorkflowNotFoundwrappers forFindChain/FindBatch, constructor guidance errors (unsupported/moved drivers), runtime dispatch input validation (niljob / uninferable job type), high-level dispatch validation (nilreceiver / invalid job), and workflow builder invalid-state errors; shared integration scenarios validate backendDispatchCtxcancellation classes (scenario_dispatch_context_cancellation), and root/integration-root contract suites assert stable error-shape semantics for missing job type / missing handler and unsupportedSnapshot(...)fallback behavior
Add assertions for user-visible errors:
- invalid config
- unsupported capability operations
- context deadline/cancel during dispatch
- workflow not found / invalid state transitions
Why:
- “Trusted” also means errors are actionable and stable enough to debug
- Add fuzz/property tests for decoding, naming, and option-validation boundaries
- Location:
- root package tests (
Fuzz...) buspackage where parsing/validation applies
- root package tests (
- Acceptance:
- at least one fuzz target for payload binding/decoding
- at least one fuzz/property target for queue-name normalization/validation
- corpus seeds include malformed/edge payloads observed in integration tests
- Notes:
- Implemented root fuzz/property targets in
fuzz_queue_test.go:FuzzJobBindMatchesJSONUnmarshal(payload decode parity withjson.Unmarshal, including malformed seeds) andFuzzNormalizeQueueName(empty->default, non-empty unchanged property) - keep runtime bounded for CI; run deeper fuzzing manually/nightly
- Implemented root fuzz/property targets in
Targets:
- payload bind/decode paths
- queue naming normalization/validation
- option parsing and combination validation
- Define and automate lightweight performance smoke thresholds
- Location:
- benchmark/smoke tooling (existing docs/bench if reused)
- CI scheduled or release-candidate workflow
- Acceptance:
- guardrail checks exist for enqueue throughput and worker lifecycle latency
- thresholds are documented as regression alarms, not optimization goals
- release process includes checking these results
- Notes:
- avoid flaky microbench gating in PR CI
- Implemented via
BenchmarkWorkerpoolLifecycle+ existing enqueue benchmarks, checked byscripts/check-bench-smoke.shand scheduled in.github/workflows/soak.ymlbenchmark-smoke; thresholds and review expectations documented indocs/performance-smoke.md
Add smoke thresholds (not optimization benchmarks):
- enqueue throughput sanity
- worker lifecycle latency sanity
- no catastrophic regressions between releases
- Make tested backend versions explicit per release
- Location:
docs/compatibility-policy.md- CI config / release docs
- Acceptance:
- supported backend versions are listed
- tested versions used in CI are visible and tied to release notes
- capability differences are documented for supported versions where relevant
- Notes:
- users need to know what combinations are actually exercised
- Implemented with
docs/compatibility-policy.md; release notes should state tested backends/versions and link CI-backed compatibility evidence for the tagged commit
Track/test supported backend versions per release so users know what combinations are actually exercised.
Minimum gate for a v1 release candidate:
GOCACHE=/tmp/queue-gocache go test ./...GOCACHE=/tmp/queue-gocache ./scripts/test-all-modules.shGOCACHE=/tmp/queue-gocache FULL=1 ./scripts/test-all-modules.sh(or equivalent split full runs)INTEGRATION_BACKEND=all GOCACHE=/tmp/queue-gocache go test -tags=integration ./integration/... -count=1scripts/coverage-codecov.shGOCACHE=/tmp/queue-gocache go run ./docs/readme/main.goGOCACHE=/tmp/queue-gocache go run ./docs/examplegen/main.gocd examples && GOCACHE=/tmp/queue-gocache go test ./... -run '^TestExamplesBuild$' -count=1.github/scripts/check_integration_scenarios_contract.shGOCACHE=/tmp/queue-gocache go test -race ./...(or approved split race jobs)- One repeated-run integration pass (nightly or RC) on timing-sensitive scenarios with flake review
Use this section for active implementation tracking. Move items from here to completed notes as they land.
- Driver guarantee matrix linked to tests/docs
- Ordering scenarios split + docs alignment
- Retry/delay timing-window assertions
- Workflow fault + duplicate callback integration expansion
- CI race job added (required scope defined)
- README manual snippet verifier in CI
- Chaos-lite failure injection expansion
- Repeat-run/soak flake tracking pipeline
- Error contract tests
- Fuzz/property suites
- Performance regression guardrails
- Versioned compatibility matrix automation
- When adding a backend:
- add it to the guarantee matrix
- add/adjust capability gates
- run/update shared integration scenarios
- When changing semantics:
- update docs + tests in the same PR
- state whether the guarantee strengthened, weakened, or was clarified
- When a regression escapes:
- add a scenario/test entry for the missing guarantee class
- record the incident in the flake/regression log