Optimize job dequeue logic and indexing for massive throughput and CP… #136

rshingleton · 2025-04-22T13:37:05Z

Optimize job dequeue logic and indexing for massive throughput and CPU efficiency

Refactored the job dequeue SQL to use a CTE with FOR UPDATE SKIP LOCKED, reducing lock contention and improving concurrency.
Introduced a composite partial index on (state, queue, priority DESC, delayed, id) for state='inactive', dramatically increasing index scan efficiency for dequeue queries.
Added targeted GIN index on parents[] for state='inactive', accelerating parent dependency checks.
Set a local lock timeout to prevent worker stalls on lock contention.
Result: CPU usage dropped from ~90% to ~35% immediately after deployment, while job throughput increased by 81%. Queue backlog and job latency both fell sharply, with worker utilization now near optimal.
These changes enable the system to process >140K jobs/hour on existing hardware, with headroom for further scaling.

Summary

This PR restructures the Minion job queue’s PostgreSQL backend to eliminate critical bottlenecks in high-throughput environments. By migrating from a subquery-based dequeue pattern to a CTE-driven approach with strategic indexing, we achieved over 3x greater resource efficiency on a 20-core Intel Xeon Gold 6338 server running PostgreSQL 14.15. The optimizations reduced CPU utilization from 88.7% to 37.7% while simultaneously increasing job processing rates from 78K to 141K jobs/hour. Queue backlogs (inactive/delayed jobs) decreased by 60%, and worker utilization improved from 22% to 87.5%, all without hardware changes.

Motivation

Prior to these changes, the system exhibited severe inefficiencies:

CPU Saturation: 90% utilization on 20 enterprise-grade cores, risking stability during traffic spikes.
Growing Backlogs: 5,356 inactive jobs and 2,252 delayed jobs due to slow dequeue operations.
Lock Contention: High wait times during concurrent job acquisition.
Underutilized Workers: Only 7/32 workers active despite available resources.

These optimizations were critical to avoid costly horizontal scaling and ensure reliable job processing for high-priority workloads. The CTE-based locking and composite indexes directly address the root cause of contention, enabling the system to leverage existing hardware fully.

Results and Evidence

Database Server Architecture

Component	Specification
CPU	Intel Xeon Gold 6338, 20 cores @ 2.00GHz
Memory	32GB RAM (20GB available, 24.8GB buffer/cache)
Database	PostgreSQL 14.15 (x86_64)

Performance Metrics

Metric	Before Optimization	After Optimization	Improvement
CPU Usage	88.7%	37.7%	-57.5%
Job Throughput	78,047 jobs/hour	141,362 jobs/hour	+81.1%
Inactive Jobs	5,356	2,058	-61.6%
Delayed Jobs	2,252	1,601	-28.9%
Active Workers	7/32	28/32	+300%
Resource Efficiency (jobs/CPU%)	0.88K	3.75K	+326%

Hourly Throughput History (Selected 24h Window)

Hour	Finished Jobs	Failed Jobs
Pre-optimization (avg)	78,047	193
Post-optimization (avg)	141,362	231
Max (post-optimization)	155,118	248

References

CPU Usage: Dropped from 88.7% (17.7/20 cores) to 37.7% (7.5/20 cores) after optimization.
Job Throughput: Increased from 78,047 jobs/hour to 141,362 jobs/hour.
Queue Backlog: Inactive jobs dropped from 5,356 to 2,058; delayed jobs from 2,252 to 1,601.
Worker Utilization: Increased from 7/32 to 28/32 active workers.
Resource Efficiency: Jobs processed per CPU% rose from 0.88K to 3.75K.
System: Intel Xeon Gold 6338, 20 cores, 32GB RAM, PostgreSQL 14.15.

Conclusion:
This PR delivers a transformative improvement in queue throughput, latency, and efficiency, enabling the system to process over 140K jobs/hour at less than 40% CPU utilization on current hardware, while leaving substantial headroom for further scaling.

kraih · 2025-04-22T14:00:01Z

Please use consistent formatting with the surrounding code.

rshingleton · 2025-04-22T14:21:07Z

Define consistent formatting.

I see at least 3 different sql formatted styles in the existing codebase:

q{sql}
"sql"
'sql'

Do you want me to remove the leading and trailing spaces and/or newlines?

kraih · 2025-04-22T15:38:57Z

I see at least 3 different sql formatted styles in the existing codebase:

q{sql} "sql" 'sql'

Those are chosen based on characters in the SQL string.

Do you want me to remove the leading and trailing spaces and/or newlines?

Yes. We want it to look like the same person wrote the whole file.

kraih · 2025-04-22T15:40:14Z

lib/Minion/Backend/Pg.pm

-      RETURNING id, args, retries, task}, $id, $options->{id}, $options->{min_priority},
-    $options->{queues} || ['default'], [keys %{$self->minion->tasks}]
+      q{
+        -- Set lock timeout to prevent long waits (50ms)


The commend doesn't really add any information the code doesn't already contain.

kraih · 2025-04-22T15:42:05Z

lib/Minion/Backend/resources/migrations/pg.sql

+    WHERE state = 'inactive';
+
+CREATE INDEX ON minion_jobs USING GIN (parents)
+    WHERE state = 'inactive';


Pretty sure this can just be one line. And move the newline above back below this line.

kraih · 2025-04-22T15:43:19Z

Looks promising, once tests pass i'll do a full review.

rshingleton · 2025-04-23T13:45:54Z

I modified the formatting. I can't tell where the perl tidy tests are failing.

As an update in my environment, the modifications continue to deliver excellent results: the queue is healthy, throughput is high, and the system is stable and efficient under sustained load.

Performance Update (Apr 23, 2025):

Metric	Pre-Optimization	Post-Optimization (Apr 22)	Current (Apr 23)	% Change (Pre → Now)
Inactive Jobs	5,356	2,058	1,405	-74%
Delayed Jobs	2,252	1,601	1,302	-42%
Finished Jobs	4,401,675	4,831,366	6,549,892	+49%
Active Workers	7 / 32	28 / 32	27 / 32	+285%
CPU Usage	88.7%	37.7%	35–40%	~-60%
Failure Rate	0.19%	0.16%	0.16%	Stable

Over 2.1 million additional jobs processed since deployment.
Queue backlogs are shrinking, throughput remains high, and worker utilization is strong.
CPU usage remains low and stable, confirming sustained efficiency.

I'd also like to note that the response times for other utility queries used in the admin ui for stats and history have dramatically improved. This is likely due to the reduced resource usage on the database.

kraih · 2025-04-23T14:45:46Z

Just click on the failing test, the output shows all the perltidy errors. The perltidyrc we use is included in the repo.

kraih · 2025-04-23T14:46:49Z

And please squash your commits.

kraih · 2025-04-24T11:29:09Z

The test output is very clear about what needs to be fixed:
https://github.com/mojolicious/minion/actions/runs/14602379397/job/40963390560?pr=136
https://github.com/mojolicious/minion/actions/runs/14602379393/job/40963390492?pr=136

…U efficiency Refactored the job dequeue SQL to use a CTE with FOR UPDATE SKIP LOCKED, reducing lock contention and improving concurrency. Introduced a composite partial index on (state, queue, priority DESC, delayed, id) for state='inactive', dramatically increasing index scan efficiency for dequeue queries. Added targeted GIN index on parents[] for state='inactive', accelerating parent dependency checks. Set a local lock timeout to prevent worker stalls on lock contention. Result: CPU usage dropped from ~90% to ~35% immediately after deployment, while job throughput increased by 81%. Queue backlog and job latency both fell sharply, with worker utilization now near optimal. These changes enable the system to process >140K jobs/hour on existing hardware, with headroom for further scaling.

rshingleton · 2025-04-25T13:51:17Z

The test output is very clear about what needs to be fixed: https://github.com/mojolicious/minion/actions/runs/14602379397/job/40963390560?pr=136 https://github.com/mojolicious/minion/actions/runs/14602379393/job/40963390492?pr=136

The first test was very clear, I fixed that test issue there.

The second test I don't find very clear.

kraih · 2025-04-25T20:50:49Z

The second test I don't find very clear.

Just run perltidy with the perltidyrc from this repo.

kraih · 2025-04-25T20:55:26Z

The commands to do it are right in the workflow.

rshingleton · 2025-04-26T01:05:32Z

I don't think these perl tidy issues are related to my PR. I checked out a completely untouched version from this repo and those tests still fail:

7QFX4D:testing shingler$ git clone https://github.com/mojolicious/minion.git
Cloning into 'minion'...
remote: Enumerating objects: 8083, done.
remote: Counting objects: 100% (513/513), done.
remote: Compressing objects: 100% (271/271), done.
remote: Total 8083 (delta 224), reused 463 (delta 207), pack-reused 7570 (from 1)
Receiving objects: 100% (8083/8083), 5.70 MiB | 16.25 MiB/s, done.
Resolving deltas: 100% (4266/4266), done.

7QFX4D:testing shingler$ cd minion/

7QFX4D:minion shingler$ export GLOBIGNORE=t/lib/MinionTest/SyntaxErrorTestTask.pm;shopt -s extglob globstar nullglob;perltidy --pro=.../.perltidyrc -b -bext='/' **/*.p[lm] **/*.t && git diff --exit-code
-bash: shopt: globstar: invalid shell option name
diff --git a/t/pg_admin.t b/t/pg_admin.t
index a789aee..c44ac4b 100644
--- a/t/pg_admin.t
+++ b/t/pg_admin.t
@@ -32,23 +32,37 @@ subtest 'Dashboard' => sub {
 };

 subtest 'Stats' => sub {
-  $t->get_ok('/minion/stats')->status_is(200)->json_is('/active_jobs' => 0)->json_is('/active_locks' => 0)
-    ->json_is('/active_workers'   => 0)->json_is('/delayed_jobs'  => 0)->json_is('/enqueued_jobs' => 2)
-    ->json_is('/failed_jobs'      => 0)->json_is('/finished_jobs' => 1)->json_is('/inactive_jobs' => 1)
-    ->json_is('/inactive_workers' => 0)->json_has('/uptime');
+  $t->get_ok('/minion/stats')
+    ->status_is(200)
+    ->json_is('/active_jobs'      => 0)
+    ->json_is('/active_locks'     => 0)
+    ->json_is('/active_workers'   => 0)
+    ->json_is('/delayed_jobs'     => 0)
+    ->json_is('/enqueued_jobs'    => 2)
+    ->json_is('/failed_jobs'      => 0)
+    ->json_is('/finished_jobs'    => 1)
+    ->json_is('/inactive_jobs'    => 1)
+    ->json_is('/inactive_workers' => 0)
+    ->json_has('/uptime');
 };

 subtest 'Jobs' => sub {
-  $t->get_ok('/minion/jobs?state=inactive')->status_is(200)->text_like('tbody td a' => qr/$inactive/)
+  $t->get_ok('/minion/jobs?state=inactive')
+    ->status_is(200)
+    ->text_like('tbody td a' => qr/$inactive/)
     ->text_unlike('tbody td a' => qr/$finished/);
-  $t->get_ok('/minion/jobs?state=finished')->status_is(200)->text_like('tbody td a' => qr/$finished/)
+  $t->get_ok('/minion/jobs?state=finished')
+    ->status_is(200)
+    ->text_like('tbody td a' => qr/$finished/)
     ->text_unlike('tbody td a' => qr/$inactive/);
 };

 subtest 'Workers' => sub {
   $t->get_ok('/minion/workers')->status_is(200)->element_exists_not('tbody td a');
   my $worker = app->minion->worker->register;
-  $t->get_ok('/minion/workers')->status_is(200)->element_exists('tbody td a')
+  $t->get_ok('/minion/workers')
+    ->status_is(200)
+    ->element_exists('tbody td a')
     ->text_like('tbody td a' => qr/@{[$worker->id]}/);
   $worker->unregister;
   $t->get_ok('/minion/workers')->status_is(200)->element_exists_not('tbody td a');
@@ -64,11 +78,15 @@ subtest 'Locks' => sub {
   $t->get_ok('/minion/locks')->status_is(200)->text_like('tbody td#lock_id' => qr/2/);
   $t->get_ok('/minion/locks?name=foo')->status_is(200)->text_like('tbody td a'       => qr/foo/);
   $t->get_ok('/minion/locks?name=foo')->status_is(200)->text_like('tbody td#lock_id' => qr/1/);
-  $t->post_ok('/minion/locks?_method=DELETE&name=bar')->status_is(200)->text_like('tbody td a' => qr/foo/)
+  $t->post_ok('/minion/locks?_method=DELETE&name=bar')
+    ->status_is(200)
+    ->text_like('tbody td a' => qr/foo/)
     ->text_like('.alert-success', qr/All selected named locks released/);
   is $t->tx->previous->res->code, 302, 'right status';
   like $t->tx->previous->res->headers->location, qr/locks/, 'right "Location" value';
-  $t->post_ok('/minion/locks?_method=DELETE&name=foo')->status_is(200)->element_exists_not('tbody td a')
+  $t->post_ok('/minion/locks?_method=DELETE&name=foo')
+    ->status_is(200)
+    ->element_exists_not('tbody td a')
     ->text_like('.alert-success', qr/All selected named locks released/);
   is $t->tx->previous->res->code, 302, 'right status';
   like $t->tx->previous->res->headers->location, qr/locks/, 'right "Location" value';

These tests all appear to be related to the UI.

kraih · 2025-04-26T12:06:00Z

There is really no point arguing about who should fix what. Just make the tests pass. This will take a whole lot more time if you wait for someone else to do it, that PR getting merged, and you having to rebase afterwards...

kraih reviewed Apr 22, 2025

View reviewed changes

rshingleton force-pushed the main branch from cbf0728 to 90858b4 Compare April 25, 2025 13:45

kraih added the work in progress label Apr 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize job dequeue logic and indexing for massive throughput and CP… #136

Optimize job dequeue logic and indexing for massive throughput and CP… #136

rshingleton commented Apr 22, 2025

kraih commented Apr 22, 2025

rshingleton commented Apr 22, 2025 •

edited

Loading

kraih commented Apr 22, 2025

kraih Apr 22, 2025

kraih Apr 22, 2025

kraih commented Apr 22, 2025

rshingleton commented Apr 23, 2025 •

edited

Loading

kraih commented Apr 23, 2025

kraih commented Apr 23, 2025

kraih commented Apr 24, 2025

rshingleton commented Apr 25, 2025

kraih commented Apr 25, 2025

kraih commented Apr 25, 2025

rshingleton commented Apr 26, 2025

kraih commented Apr 26, 2025

Optimize job dequeue logic and indexing for massive throughput and CP… #136

Are you sure you want to change the base?

Optimize job dequeue logic and indexing for massive throughput and CP… #136

Conversation

rshingleton commented Apr 22, 2025

Summary

Motivation

Results and Evidence

Database Server Architecture

Performance Metrics

Hourly Throughput History (Selected 24h Window)

References

kraih commented Apr 22, 2025

rshingleton commented Apr 22, 2025 • edited Loading

kraih commented Apr 22, 2025

kraih Apr 22, 2025

Choose a reason for hiding this comment

kraih Apr 22, 2025

Choose a reason for hiding this comment

kraih commented Apr 22, 2025

rshingleton commented Apr 23, 2025 • edited Loading

kraih commented Apr 23, 2025

kraih commented Apr 23, 2025

kraih commented Apr 24, 2025

rshingleton commented Apr 25, 2025

kraih commented Apr 25, 2025

kraih commented Apr 25, 2025

rshingleton commented Apr 26, 2025

kraih commented Apr 26, 2025

rshingleton commented Apr 22, 2025 •

edited

Loading

rshingleton commented Apr 23, 2025 •

edited

Loading