Add `river bench` benchmarking tool producing succinct output + summary #254

brandur · 2024-03-08T02:41:15Z

Here, add a new benchmarking tool to the main River CLI. We had an
existing one, but it hasn't been used or updated in ages, and was
written quite quickly, without too much concern for UX.

This river bench is user-runnable, and designed to produce output
that's succinct and easily comprehensible. Every two seconds it produces
a new line of output with the number of jobs worked during that period,
number of jobs inserted during that period, and the rough jobs per
second being complete. When the program is interrupted via SIGINT, it
produces one final log line indicating similar information, but
calculated across the entire run period.

$ go run main.go bench --database-url $DATABASE_URL
bench: jobs worked [          0 ], inserted [      50000 ], job/sec [        0.0 ] [0s]
bench: jobs worked [      22445 ], inserted [      22000 ], job/sec [    11222.5 ] [2s]
bench: jobs worked [      26504 ], inserted [      28000 ], job/sec [    13252.0 ] [2s]
bench: jobs worked [      25919 ], inserted [      24000 ], job/sec [    12959.5 ] [2s]
bench: jobs worked [      27432 ], inserted [      28000 ], job/sec [    13716.0 ] [2s]
bench: jobs worked [      26068 ], inserted [      26000 ], job/sec [    13034.0 ] [2s]
bench: jobs worked [      27068 ], inserted [      28000 ], job/sec [    13534.0 ] [2s]
bench: jobs worked [      27876 ], inserted [      28000 ], job/sec [    13938.0 ] [2s]
bench: jobs worked [      25058 ], inserted [      24000 ], job/sec [    12529.0 ] [2s]
^Cbench: total jobs worked [     214356 ], total jobs inserted [     264000 ], overall job/sec [    13026.7 ], running 16.455185125s

It can also run with a total duration, which will be useful if we're
trying to compare runs across branches without having to try and time it
artificially:

$ go run main.go bench --database-url $DATABASE_URL --duration 30s
bench: jobs worked [          0 ], inserted [      50000 ], job/sec [        0.0 ] [0s]
bench: jobs worked [      23875 ], inserted [      24000 ], job/sec [    11937.5 ] [2s]
bench: jobs worked [      27964 ], inserted [      28000 ], job/sec [    13982.0 ] [2s]
bench: jobs worked [      25694 ], inserted [      26000 ], job/sec [    12847.0 ] [2s]
bench: jobs worked [      26649 ], inserted [      26000 ], job/sec [    13324.5 ] [2s]
bench: jobs worked [      26872 ], inserted [      28000 ], job/sec [    13436.0 ] [2s]
bench: jobs worked [      26519 ], inserted [      26000 ], job/sec [    13259.5 ] [2s]
bench: jobs worked [      25077 ], inserted [      24000 ], job/sec [    12538.5 ] [2s]
bench: jobs worked [      24126 ], inserted [      26000 ], job/sec [    12063.0 ] [2s]
bench: jobs worked [      23936 ], inserted [      22000 ], job/sec [    11968.0 ] [2s]
bench: jobs worked [      26044 ], inserted [      28000 ], job/sec [    13022.0 ] [2s]
bench: jobs worked [      26289 ], inserted [      26000 ], job/sec [    13144.5 ] [2s]
bench: jobs worked [      23058 ], inserted [      22000 ], job/sec [    11529.0 ] [2s]
bench: jobs worked [      23474 ], inserted [      24000 ], job/sec [    11737.0 ] [2s]
bench: jobs worked [      25380 ], inserted [      26000 ], job/sec [    12690.0 ] [2s]
bench: total jobs worked [     375743 ], total jobs inserted [     426000 ], overall job/sec [    12524.8 ], running 30.000017167s

Unlike the old benchmarking tool, we switch this one over to do job
accounting using a client subscribe channel instead of measuring it in
the worker. Measuring in the worker doesn't account for the time needed
to block in the job executor waiting for a goroutine to become available
in the completer to complete a job, making it less accurate and possibly
prone to memory overruns as a large backlog of jobs have been accounted
as completed but are actually waiting for a completer slot.

I'm not going to call this feature complete, but I think it's a step in
the right direction, and the hope is that it'll give us a reasonable way
to gutcheck new changes and see whether they cause an obvious regression
or improvement in total performance.

brandur · 2024-03-09T16:31:03Z

@bgentry Added a "burndown" mode that does all the insertion up front:

$ go run main.go bench --database-url $DATABASE_URL -n 1_000_000
bench: jobs worked [          0 ], inserted [    1000000 ], job/sec [        0.0 ] [0s]
bench: jobs worked [       2255 ], inserted [          0 ], job/sec [     1127.5 ] [2s]
bench: jobs worked [       3658 ], inserted [          0 ], job/sec [     1829.0 ] [2s]
bench: jobs worked [       4450 ], inserted [          0 ], job/sec [     2225.0 ] [2s]
bench: jobs worked [       4689 ], inserted [          0 ], job/sec [     2344.5 ] [2s]
bench: jobs worked [       4745 ], inserted [          0 ], job/sec [     2372.5 ] [2s]
bench: jobs worked [       4675 ], inserted [          0 ], job/sec [     2337.5 ] [2s]
bench: jobs worked [       4704 ], inserted [          0 ], job/sec [     2352.0 ] [2s]
bench: jobs worked [       4614 ], inserted [          0 ], job/sec [     2307.0 ] [2s]
bench: jobs worked [       4611 ], inserted [          0 ], job/sec [     2305.5 ] [2s]
^Cbench: total jobs worked [      41704 ], total jobs inserted [    1000000 ], overall job/sec [     2142.0 ], running 19.469791417s

IMO the normal mode is more useful because it better simulates a real queue that's getting new insertions and working existing jobs, but the alternative is available in case it's needed.

Here, add a new completer using a completion strategy designed to be much faster than what we're doing right now. Rather than blindly throwing completion work into goroutine slots, it accumulates "batches" of completions to be carried out, and using a debounced channel to fire periodically (currently, up to every 100 milliseconds) and submit entire batches for completion at once up to 2,000 jobs. For the purposes of not grossly expanding the `riverdriver` interface, the completer only batches jobs being set to `completed`, which under most normal workloads we expect to be the vast common case. Jobs going to other states are fed into a member `AsyncCompleter`, thereby allowing the `BatchCompleter` to keeps implementation quite simple. According to in-package benchmarking, the new completer is in the range of 3-5x faster than `AsyncCompleter` (the one currently in use by River client), and 10-15x faster than `InlineCompleter`. $ go test -bench=. ./internal/jobcompleter goos: darwin goarch: arm64 pkg: github.com/riverqueue/river/internal/jobcompleter BenchmarkAsyncCompleter_Concurrency10/Completion-8 10851 112318 ns/op BenchmarkAsyncCompleter_Concurrency10/RotatingStates-8 11386 120706 ns/op BenchmarkAsyncCompleter_Concurrency100/Completion-8 9763 116773 ns/op BenchmarkAsyncCompleter_Concurrency100/RotatingStates-8 10884 115718 ns/op BenchmarkBatchCompleter/Completion-8 54916 27314 ns/op BenchmarkBatchCompleter/RotatingStates-8 11518 100997 ns/op BenchmarkInlineCompleter/Completion-8 4656 369281 ns/op BenchmarkInlineCompleter/RotatingStates-8 1561 794136 ns/op PASS ok github.com/riverqueue/river/internal/jobcompleter 21.123s Along with the new completer, we also add a vastly more thorough test suite to help tease out race conditions and test edges that were previously being ignored completely. For most cases we drop the heavy mocking that was happening before, which was having the effect of minimizing the surface area under test, and producing misleading timing that wasn't realistic. Similarly, we bring in a new benchmark framework to allow us to easily vet and compare completer implementations relative to each other. The expectation is that this will act as a more synthetic proxy, with the new benchmarking tool in #254 providing a more realistic end-to-end measurement.

Here, add a new benchmarking tool to the main River CLI. We had an existing one, but it hasn't been used or updated in ages, and was written quite quickly, without too much concern for UX. This `river bench` is user-runnable, and designed to produce output that's succinct and easily comprehensible. Every two seconds it produces a new line of output with the number of jobs worked during that period, number of jobs inserted during that period, and the rough jobs per second being complete. When the program is interrupted via `SIGINT`, it produces one final log line indicating similar information, but calculated across the entire run period. $ go run main.go bench --database-url $DATABASE_URL bench: jobs worked [ 0 ], inserted [ 50000 ], job/sec [ 0.0 ] [0s] bench: jobs worked [ 22445 ], inserted [ 22000 ], job/sec [ 11222.5 ] [2s] bench: jobs worked [ 26504 ], inserted [ 28000 ], job/sec [ 13252.0 ] [2s] bench: jobs worked [ 25919 ], inserted [ 24000 ], job/sec [ 12959.5 ] [2s] bench: jobs worked [ 27432 ], inserted [ 28000 ], job/sec [ 13716.0 ] [2s] bench: jobs worked [ 26068 ], inserted [ 26000 ], job/sec [ 13034.0 ] [2s] bench: jobs worked [ 27068 ], inserted [ 28000 ], job/sec [ 13534.0 ] [2s] bench: jobs worked [ 27876 ], inserted [ 28000 ], job/sec [ 13938.0 ] [2s] bench: jobs worked [ 25058 ], inserted [ 24000 ], job/sec [ 12529.0 ] [2s] ^Cbench: total jobs worked [ 214356 ], total jobs inserted [ 264000 ], overall job/sec [ 13026.7 ], running 16.455185125s It can also run with a total duration, which will be useful if we're trying to compare runs across branches without having to try and time it artificially: $ go run main.go bench --database-url $DATABASE_URL --duration 30s bench: jobs worked [ 0 ], inserted [ 50000 ], job/sec [ 0.0 ] [0s] bench: jobs worked [ 23875 ], inserted [ 24000 ], job/sec [ 11937.5 ] [2s] bench: jobs worked [ 27964 ], inserted [ 28000 ], job/sec [ 13982.0 ] [2s] bench: jobs worked [ 25694 ], inserted [ 26000 ], job/sec [ 12847.0 ] [2s] bench: jobs worked [ 26649 ], inserted [ 26000 ], job/sec [ 13324.5 ] [2s] bench: jobs worked [ 26872 ], inserted [ 28000 ], job/sec [ 13436.0 ] [2s] bench: jobs worked [ 26519 ], inserted [ 26000 ], job/sec [ 13259.5 ] [2s] bench: jobs worked [ 25077 ], inserted [ 24000 ], job/sec [ 12538.5 ] [2s] bench: jobs worked [ 24126 ], inserted [ 26000 ], job/sec [ 12063.0 ] [2s] bench: jobs worked [ 23936 ], inserted [ 22000 ], job/sec [ 11968.0 ] [2s] bench: jobs worked [ 26044 ], inserted [ 28000 ], job/sec [ 13022.0 ] [2s] bench: jobs worked [ 26289 ], inserted [ 26000 ], job/sec [ 13144.5 ] [2s] bench: jobs worked [ 23058 ], inserted [ 22000 ], job/sec [ 11529.0 ] [2s] bench: jobs worked [ 23474 ], inserted [ 24000 ], job/sec [ 11737.0 ] [2s] bench: jobs worked [ 25380 ], inserted [ 26000 ], job/sec [ 12690.0 ] [2s] bench: total jobs worked [ 375743 ], total jobs inserted [ 426000 ], overall job/sec [ 12524.8 ], running 30.000017167s Unlike the old benchmarking tool, we switch this one over to do job accounting using a client subscribe channel instead of measuring it in the worker. Measuring in the worker doesn't account for the time needed to block in the job executor waiting for a goroutine to become available in the completer to complete a job, making it less accurate and possibly prone to memory overruns as a large backlog of jobs have been accounted as completed but are actually waiting for a completer slot. I'm not going to call this feature complete, but I think it's a step in the right direction, and the hope is that it'll give us a reasonable way to gutcheck new changes and see whether they cause an obvious regression or improvement in total performance.

bgentry

LGTM aside from comments, good stuff 👏

bgentry · 2024-03-10T02:18:47Z

cmd/river/riverbench/river_bench.go

+	client, err := river.NewClient(b.driver, &river.Config{
+		Logger: slog.New(slog.NewTextHandler(os.Stdout, &slog.HandlerOptions{Level: slog.LevelWarn})),
+		Queues: map[string]river.QueueConfig{
+			river.QueueDefault: {MaxWorkers: river.QueueNumWorkersMax},


aaaaWe'd probably want to make this MaxWorkers param an argument to the bench command. Setting it to 10k is probably not realistic, and it would probably be useful to play around with different input values to find the sweet spot on throughput.

The fetch cooldown is another attr that's probably worth setting up as an input variable to this. A queue that's optimized for throughput is going to have a much shorter & more aggressive fetch cooldown than most users would typically want (because it would have the tradeoff of constantly hitting the DB to fetch more jobs).

I would suggest setting these each to a specific default that doesn't necessarily map to River's own defaults, because those defaults are likely going to be what's better for a large number of smaller users than the small percentage of users chasing max throughput. But if your purpose is actually to gauge throughput, you'll want some different values.

Ah, good call on this one. I chose the max workers after num CPUs performed very poorly, and never reevaluated.

I played around with the numbers quite a bit, and although I wouldn't suggest these are perfect, ended up with some that perform quite well (much better than what I'd had here before). I'm sure more refinement is possible, but will leave that as a future improvement.

client, err := river.NewClient(b.driver, &river.Config{ // When benchmarking to maximize job throughput these numbers have an // outside effect on results. The ones chosen here could possibly be // optimized further, but based on my tests of throwing a lot of random // values against the wall, they perform quite well. Much better than // the client's default values at any rate. FetchCooldown: 2 * time.Millisecond, FetchPollInterval: 5 * time.Millisecond, Logger: slog.New(slog.NewTextHandler(os.Stdout, &slog.HandlerOptions{Level: slog.LevelWarn})), Queues: map[string]river.QueueConfig{ // This could probably use more refinement, but in my quick and // dirty tests I found that roughly 1k workers was most optimal. 500 // and 2,000 performed a little more poorly, and jumping up to the // 10k performed considerably less well (scheduler contention?). // There may be a more optimal number than 1,000, but it seems close // enough to target for now. river.QueueDefault: {MaxWorkers: 1_000}, }, Workers: workers, })

bgentry · 2024-03-10T02:26:39Z

cmd/river/go.mod

@@ -22,9 +23,10 @@ require (
 	github.com/jackc/pgpassfile v1.0.0 // indirect
 	github.com/jackc/pgservicefile v0.0.0-20221227161230-091c0ba34f0a // indirect
 	github.com/jackc/puddle/v2 v2.2.1 // indirect
-	github.com/riverqueue/river/riverdriver v0.0.17 // indirect
+	github.com/oklog/ulid/v2 v2.1.0 // indirect


seems like this is outdated diff?

Ugh, I thought so too, but that's what go mod tidy produces. The ULID dep should be dropped next time we cut a release and can target the River CLI Go submodule against it.

brandur · 2024-03-10T04:03:31Z

Thanks!

Here, add a new completer using a completion strategy designed to be much faster than what we're doing right now. Rather than blindly throwing completion work into goroutine slots, it accumulates "batches" of completions to be carried out, and using a debounced channel to fire periodically (currently, up to every 100 milliseconds) and submit entire batches for completion at once up to 2,000 jobs. For the purposes of not grossly expanding the `riverdriver` interface, the completer only batches jobs being set to `completed`, which under most normal workloads we expect to be the vast common case. Jobs going to other states are fed into a member `AsyncCompleter`, thereby allowing the `BatchCompleter` to keeps implementation quite simple. According to in-package benchmarking, the new completer is in the range of 3-5x faster than `AsyncCompleter` (the one currently in use by River client), and 10-15x faster than `InlineCompleter`. $ go test -bench=. ./internal/jobcompleter goos: darwin goarch: arm64 pkg: github.com/riverqueue/river/internal/jobcompleter BenchmarkAsyncCompleter_Concurrency10/Completion-8 10851 112318 ns/op BenchmarkAsyncCompleter_Concurrency10/RotatingStates-8 11386 120706 ns/op BenchmarkAsyncCompleter_Concurrency100/Completion-8 9763 116773 ns/op BenchmarkAsyncCompleter_Concurrency100/RotatingStates-8 10884 115718 ns/op BenchmarkBatchCompleter/Completion-8 54916 27314 ns/op BenchmarkBatchCompleter/RotatingStates-8 11518 100997 ns/op BenchmarkInlineCompleter/Completion-8 4656 369281 ns/op BenchmarkInlineCompleter/RotatingStates-8 1561 794136 ns/op PASS ok github.com/riverqueue/river/internal/jobcompleter 21.123s Along with the new completer, we also add a vastly more thorough test suite to help tease out race conditions and test edges that were previously being ignored completely. For most cases we drop the heavy mocking that was happening before, which was having the effect of minimizing the surface area under test, and producing misleading timing that wasn't realistic. Similarly, we bring in a new benchmark framework to allow us to easily vet and compare completer implementations relative to each other. The expectation is that this will act as a more synthetic proxy, with the new benchmarking tool in #254 providing a more realistic end-to-end measurement.

Some tweaks to the benchmark client in #254 which were meant to be included there, but which I apparently forgot to push to GitHub before pushing the "merge" button. See comment [1]. [1] #254 (comment)

Here, add a new completer using a completion strategy designed to be much faster than what we're doing right now. Rather than blindly throwing completion work into goroutine slots, it accumulates "batches" of completions to be carried out, and using a debounced channel to fire periodically (currently, up to every 100 milliseconds) and submit entire batches for completion at once up to 2,000 jobs. For the purposes of not grossly expanding the `riverdriver` interface, the completer only batches jobs being set to `completed`, which under most normal workloads we expect to be the vast common case. Jobs going to other states are fed into a member `AsyncCompleter`, thereby allowing the `BatchCompleter` to keeps implementation quite simple. According to in-package benchmarking, the new completer is in the range of 3-5x faster than `AsyncCompleter` (the one currently in use by River client), and 10-15x faster than `InlineCompleter`. $ go test -bench=. ./internal/jobcompleter goos: darwin goarch: arm64 pkg: github.com/riverqueue/river/internal/jobcompleter BenchmarkAsyncCompleter_Concurrency10/Completion-8 10851 112318 ns/op BenchmarkAsyncCompleter_Concurrency10/RotatingStates-8 11386 120706 ns/op BenchmarkAsyncCompleter_Concurrency100/Completion-8 9763 116773 ns/op BenchmarkAsyncCompleter_Concurrency100/RotatingStates-8 10884 115718 ns/op BenchmarkBatchCompleter/Completion-8 54916 27314 ns/op BenchmarkBatchCompleter/RotatingStates-8 11518 100997 ns/op BenchmarkInlineCompleter/Completion-8 4656 369281 ns/op BenchmarkInlineCompleter/RotatingStates-8 1561 794136 ns/op PASS ok github.com/riverqueue/river/internal/jobcompleter 21.123s Along with the new completer, we also add a vastly more thorough test suite to help tease out race conditions and test edges that were previously being ignored completely. For most cases we drop the heavy mocking that was happening before, which was having the effect of minimizing the surface area under test, and producing misleading timing that wasn't realistic. Similarly, we bring in a new benchmark framework to allow us to easily vet and compare completer implementations relative to each other. The expectation is that this will act as a more synthetic proxy, with the new benchmarking tool in #254 providing a more realistic end-to-end measurement.

brandur requested a review from bgentry March 8, 2024 02:45

brandur force-pushed the brandur-river-bench branch from e453cbd to 6d5f40f Compare March 9, 2024 16:10

brandur force-pushed the brandur-river-bench branch from 6d5f40f to cbfb9a2 Compare March 9, 2024 16:31

brandur mentioned this pull request Mar 9, 2024

Batch completer + additional completer test suite and benchmarks #258

Merged

brandur force-pushed the brandur-river-bench branch from cbfb9a2 to ebced70 Compare March 10, 2024 01:08

bgentry approved these changes Mar 10, 2024

View reviewed changes

brandur merged commit fbd907f into master Mar 10, 2024
10 checks passed

brandur deleted the brandur-river-bench branch March 10, 2024 04:03

brandur mentioned this pull request Mar 10, 2024

Tweaks to benchmark client configuration #259

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `river bench` benchmarking tool producing succinct output + summary #254

Add `river bench` benchmarking tool producing succinct output + summary #254

brandur commented Mar 8, 2024

brandur commented Mar 9, 2024

bgentry left a comment

bgentry Mar 10, 2024

brandur Mar 10, 2024

bgentry Mar 10, 2024

brandur Mar 10, 2024

brandur commented Mar 10, 2024

Add river bench benchmarking tool producing succinct output + summary #254

Add river bench benchmarking tool producing succinct output + summary #254

Conversation

brandur commented Mar 8, 2024

brandur commented Mar 9, 2024

bgentry left a comment

Choose a reason for hiding this comment

bgentry Mar 10, 2024

Choose a reason for hiding this comment

brandur Mar 10, 2024

Choose a reason for hiding this comment

bgentry Mar 10, 2024

Choose a reason for hiding this comment

brandur Mar 10, 2024

Choose a reason for hiding this comment

brandur commented Mar 10, 2024

Add `river bench` benchmarking tool producing succinct output + summary #254

Add `river bench` benchmarking tool producing succinct output + summary #254