Skip to content

Optimize job dequeue logic and indexing for massive throughput and CP… #136

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rshingleton
Copy link

Optimize job dequeue logic and indexing for massive throughput and CPU efficiency

  • Refactored the job dequeue SQL to use a CTE with FOR UPDATE SKIP LOCKED, reducing lock contention and improving concurrency.
  • Introduced a composite partial index on (state, queue, priority DESC, delayed, id) for state='inactive', dramatically increasing index scan efficiency for dequeue queries.
  • Added targeted GIN index on parents[] for state='inactive', accelerating parent dependency checks.
  • Set a local lock timeout to prevent worker stalls on lock contention.
  • Result: CPU usage dropped from ~90% to ~35% immediately after deployment, while job throughput increased by 81%. Queue backlog and job latency both fell sharply, with worker utilization now near optimal.
  • These changes enable the system to process >140K jobs/hour on existing hardware, with headroom for further scaling.

Summary

This PR restructures the Minion job queue’s PostgreSQL backend to eliminate critical bottlenecks in high-throughput environments. By migrating from a subquery-based dequeue pattern to a CTE-driven approach with strategic indexing, we achieved over 3x greater resource efficiency on a 20-core Intel Xeon Gold 6338 server running PostgreSQL 14.15. The optimizations reduced CPU utilization from 88.7% to 37.7% while simultaneously increasing job processing rates from 78K to 141K jobs/hour. Queue backlogs (inactive/delayed jobs) decreased by 60%, and worker utilization improved from 22% to 87.5%, all without hardware changes.


Motivation

Prior to these changes, the system exhibited severe inefficiencies:

  • CPU Saturation: 90% utilization on 20 enterprise-grade cores, risking stability during traffic spikes.
  • Growing Backlogs: 5,356 inactive jobs and 2,252 delayed jobs due to slow dequeue operations.
  • Lock Contention: High wait times during concurrent job acquisition.
  • Underutilized Workers: Only 7/32 workers active despite available resources.

These optimizations were critical to avoid costly horizontal scaling and ensure reliable job processing for high-priority workloads. The CTE-based locking and composite indexes directly address the root cause of contention, enabling the system to leverage existing hardware fully.


Results and Evidence

Database Server Architecture

Component Specification
CPU Intel Xeon Gold 6338, 20 cores @ 2.00GHz
Memory 32GB RAM (20GB available, 24.8GB buffer/cache)
Database PostgreSQL 14.15 (x86_64)

Performance Metrics

Metric Before Optimization After Optimization Improvement
CPU Usage 88.7% 37.7% -57.5%
Job Throughput 78,047 jobs/hour 141,362 jobs/hour +81.1%
Inactive Jobs 5,356 2,058 -61.6%
Delayed Jobs 2,252 1,601 -28.9%
Active Workers 7/32 28/32 +300%
Resource Efficiency (jobs/CPU%) 0.88K 3.75K +326%

Hourly Throughput History (Selected 24h Window)

Hour Finished Jobs Failed Jobs
Pre-optimization (avg) 78,047 193
Post-optimization (avg) 141,362 231
Max (post-optimization) 155,118 248

References

  • CPU Usage: Dropped from 88.7% (17.7/20 cores) to 37.7% (7.5/20 cores) after optimization.
  • Job Throughput: Increased from 78,047 jobs/hour to 141,362 jobs/hour.
  • Queue Backlog: Inactive jobs dropped from 5,356 to 2,058; delayed jobs from 2,252 to 1,601.
  • Worker Utilization: Increased from 7/32 to 28/32 active workers.
  • Resource Efficiency: Jobs processed per CPU% rose from 0.88K to 3.75K.
  • System: Intel Xeon Gold 6338, 20 cores, 32GB RAM, PostgreSQL 14.15.

Conclusion:
This PR delivers a transformative improvement in queue throughput, latency, and efficiency, enabling the system to process over 140K jobs/hour at less than 40% CPU utilization on current hardware, while leaving substantial headroom for further scaling.

@kraih
Copy link
Member

kraih commented Apr 22, 2025

Please use consistent formatting with the surrounding code.

@rshingleton
Copy link
Author

rshingleton commented Apr 22, 2025

Define consistent formatting.

I see at least 3 different sql formatted styles in the existing codebase:

q{sql}
"sql"
'sql'

Do you want me to remove the leading and trailing spaces and/or newlines?

@kraih
Copy link
Member

kraih commented Apr 22, 2025

I see at least 3 different sql formatted styles in the existing codebase:

q{sql} "sql" 'sql'

Those are chosen based on characters in the SQL string.

Do you want me to remove the leading and trailing spaces and/or newlines?

Yes. We want it to look like the same person wrote the whole file.

RETURNING id, args, retries, task}, $id, $options->{id}, $options->{min_priority},
$options->{queues} || ['default'], [keys %{$self->minion->tasks}]
q{
-- Set lock timeout to prevent long waits (50ms)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commend doesn't really add any information the code doesn't already contain.

WHERE state = 'inactive';

CREATE INDEX ON minion_jobs USING GIN (parents)
WHERE state = 'inactive';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure this can just be one line. And move the newline above back below this line.

@kraih
Copy link
Member

kraih commented Apr 22, 2025

Looks promising, once tests pass i'll do a full review.

@rshingleton
Copy link
Author

rshingleton commented Apr 23, 2025

I modified the formatting. I can't tell where the perl tidy tests are failing.

As an update in my environment, the modifications continue to deliver excellent results: the queue is healthy, throughput is high, and the system is stable and efficient under sustained load.

Performance Update (Apr 23, 2025):

Metric Pre-Optimization Post-Optimization (Apr 22) Current (Apr 23) % Change (Pre → Now)
Inactive Jobs 5,356 2,058 1,405 -74%
Delayed Jobs 2,252 1,601 1,302 -42%
Finished Jobs 4,401,675 4,831,366 6,549,892 +49%
Active Workers 7 / 32 28 / 32 27 / 32 +285%
CPU Usage 88.7% 37.7% 35–40% ~-60%
Failure Rate 0.19% 0.16% 0.16% Stable
  • Over 2.1 million additional jobs processed since deployment.
  • Queue backlogs are shrinking, throughput remains high, and worker utilization is strong.
  • CPU usage remains low and stable, confirming sustained efficiency.

I'd also like to note that the response times for other utility queries used in the admin ui for stats and history have dramatically improved. This is likely due to the reduced resource usage on the database.

@kraih
Copy link
Member

kraih commented Apr 23, 2025

Just click on the failing test, the output shows all the perltidy errors. The perltidyrc we use is included in the repo.

@kraih
Copy link
Member

kraih commented Apr 23, 2025

And please squash your commits.

@kraih
Copy link
Member

kraih commented Apr 24, 2025

…U efficiency

Refactored the job dequeue SQL to use a CTE with FOR UPDATE SKIP LOCKED, reducing lock contention and improving concurrency.
Introduced a composite partial index on (state, queue, priority DESC, delayed, id) for state='inactive', dramatically increasing index scan efficiency for dequeue queries.
Added targeted GIN index on parents[] for state='inactive', accelerating parent dependency checks.
Set a local lock timeout to prevent worker stalls on lock contention.
Result: CPU usage dropped from ~90% to ~35% immediately after deployment, while job throughput increased by 81%. Queue backlog and job latency both fell sharply, with worker utilization now near optimal.
These changes enable the system to process >140K jobs/hour on existing hardware, with headroom for further scaling.
@rshingleton
Copy link
Author

The test output is very clear about what needs to be fixed: https://github.com/mojolicious/minion/actions/runs/14602379397/job/40963390560?pr=136 https://github.com/mojolicious/minion/actions/runs/14602379393/job/40963390492?pr=136

The first test was very clear, I fixed that test issue there.

The second test I don't find very clear.

@kraih
Copy link
Member

kraih commented Apr 25, 2025

The second test I don't find very clear.

Just run perltidy with the perltidyrc from this repo.

@kraih
Copy link
Member

kraih commented Apr 25, 2025

The commands to do it are right in the workflow.

@rshingleton
Copy link
Author

I don't think these perl tidy issues are related to my PR. I checked out a completely untouched version from this repo and those tests still fail:

7QFX4D:testing shingler$ git clone https://github.com/mojolicious/minion.git
Cloning into 'minion'...
remote: Enumerating objects: 8083, done.
remote: Counting objects: 100% (513/513), done.
remote: Compressing objects: 100% (271/271), done.
remote: Total 8083 (delta 224), reused 463 (delta 207), pack-reused 7570 (from 1)
Receiving objects: 100% (8083/8083), 5.70 MiB | 16.25 MiB/s, done.
Resolving deltas: 100% (4266/4266), done.

7QFX4D:testing shingler$ cd minion/

7QFX4D:minion shingler$ export GLOBIGNORE=t/lib/MinionTest/SyntaxErrorTestTask.pm;shopt -s extglob globstar nullglob;perltidy --pro=.../.perltidyrc -b -bext='/' **/*.p[lm] **/*.t && git diff --exit-code
-bash: shopt: globstar: invalid shell option name
diff --git a/t/pg_admin.t b/t/pg_admin.t
index a789aee..c44ac4b 100644
--- a/t/pg_admin.t
+++ b/t/pg_admin.t
@@ -32,23 +32,37 @@ subtest 'Dashboard' => sub {
 };

 subtest 'Stats' => sub {
-  $t->get_ok('/minion/stats')->status_is(200)->json_is('/active_jobs' => 0)->json_is('/active_locks' => 0)
-    ->json_is('/active_workers'   => 0)->json_is('/delayed_jobs'  => 0)->json_is('/enqueued_jobs' => 2)
-    ->json_is('/failed_jobs'      => 0)->json_is('/finished_jobs' => 1)->json_is('/inactive_jobs' => 1)
-    ->json_is('/inactive_workers' => 0)->json_has('/uptime');
+  $t->get_ok('/minion/stats')
+    ->status_is(200)
+    ->json_is('/active_jobs'      => 0)
+    ->json_is('/active_locks'     => 0)
+    ->json_is('/active_workers'   => 0)
+    ->json_is('/delayed_jobs'     => 0)
+    ->json_is('/enqueued_jobs'    => 2)
+    ->json_is('/failed_jobs'      => 0)
+    ->json_is('/finished_jobs'    => 1)
+    ->json_is('/inactive_jobs'    => 1)
+    ->json_is('/inactive_workers' => 0)
+    ->json_has('/uptime');
 };

 subtest 'Jobs' => sub {
-  $t->get_ok('/minion/jobs?state=inactive')->status_is(200)->text_like('tbody td a' => qr/$inactive/)
+  $t->get_ok('/minion/jobs?state=inactive')
+    ->status_is(200)
+    ->text_like('tbody td a' => qr/$inactive/)
     ->text_unlike('tbody td a' => qr/$finished/);
-  $t->get_ok('/minion/jobs?state=finished')->status_is(200)->text_like('tbody td a' => qr/$finished/)
+  $t->get_ok('/minion/jobs?state=finished')
+    ->status_is(200)
+    ->text_like('tbody td a' => qr/$finished/)
     ->text_unlike('tbody td a' => qr/$inactive/);
 };

 subtest 'Workers' => sub {
   $t->get_ok('/minion/workers')->status_is(200)->element_exists_not('tbody td a');
   my $worker = app->minion->worker->register;
-  $t->get_ok('/minion/workers')->status_is(200)->element_exists('tbody td a')
+  $t->get_ok('/minion/workers')
+    ->status_is(200)
+    ->element_exists('tbody td a')
     ->text_like('tbody td a' => qr/@{[$worker->id]}/);
   $worker->unregister;
   $t->get_ok('/minion/workers')->status_is(200)->element_exists_not('tbody td a');
@@ -64,11 +78,15 @@ subtest 'Locks' => sub {
   $t->get_ok('/minion/locks')->status_is(200)->text_like('tbody td#lock_id' => qr/2/);
   $t->get_ok('/minion/locks?name=foo')->status_is(200)->text_like('tbody td a'       => qr/foo/);
   $t->get_ok('/minion/locks?name=foo')->status_is(200)->text_like('tbody td#lock_id' => qr/1/);
-  $t->post_ok('/minion/locks?_method=DELETE&name=bar')->status_is(200)->text_like('tbody td a' => qr/foo/)
+  $t->post_ok('/minion/locks?_method=DELETE&name=bar')
+    ->status_is(200)
+    ->text_like('tbody td a' => qr/foo/)
     ->text_like('.alert-success', qr/All selected named locks released/);
   is $t->tx->previous->res->code, 302, 'right status';
   like $t->tx->previous->res->headers->location, qr/locks/, 'right "Location" value';
-  $t->post_ok('/minion/locks?_method=DELETE&name=foo')->status_is(200)->element_exists_not('tbody td a')
+  $t->post_ok('/minion/locks?_method=DELETE&name=foo')
+    ->status_is(200)
+    ->element_exists_not('tbody td a')
     ->text_like('.alert-success', qr/All selected named locks released/);
   is $t->tx->previous->res->code, 302, 'right status';
   like $t->tx->previous->res->headers->location, qr/locks/, 'right "Location" value';

These tests all appear to be related to the UI.

@kraih
Copy link
Member

kraih commented Apr 26, 2025

There is really no point arguing about who should fix what. Just make the tests pass. This will take a whole lot more time if you wait for someone else to do it, that PR getting merged, and you having to rebase afterwards...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants