Skip to content

Supervisor constantly crashing and cycle rebooting with high throughput  #1540

Open
@sts-ryan-holton

Description

@sts-ryan-holton

Horizon Version

5.30.3

Laravel Version

11.43.2

PHP Version

8.3

Redis Driver

PhpRedis

Redis Version

7

Database Driver & Version

MySQL 8

Description

Hi, not sure whether this is related to this but think this is something that could be changed. I'm aware Redis can only use one CPU core, so I've split out my app into 4 Redis instances, where each instance does it's own thing.

One instance is responsibly for processing 500,000 jobs every 5 minutes, they're pretty quick jobs taking less than a second each.

I'm running two instances of Horizon on two different VMs configured via Forge, connected to my "statistics" Redis database.

But with this number of jobs, the supervisor keeps dying despite there being more than enough RAM and CPU available on each VM, 32 GB of RAM and 32 cores per instance. How can I keep supervisor alive for longer, there's not much data shown when it dies, and how can I get it to start quicker?

Given that Redis clustering isn't really an option here, what changes in Horizon or to my config should I make to reduce crashing or speed up rebooting.

Maybe there's a way for Horizon to remember that "last known" number of processes that worked and set it to that internally in the event of successive crashes?

Steps To Reproduce

In this screenshot, the supervisor-statistics would show the red cross

  1. No more detail is showed next to the red cross why it died
  2. When it dies it seems to take forever to restart
  3. It then dies a few moments later

How can I optimise this?

Image

My Horizon config for this part:

<?php

use Illuminate\Support\Str;

return [

    /*
    |--------------------------------------------------------------------------
    | Horizon Domain
    |--------------------------------------------------------------------------
    |
    | This is the subdomain where Horizon will be accessible from. If this
    | setting is null, Horizon will reside under the same domain as the
    | application. Otherwise, this value will serve as the subdomain.
    |
    */

    'domain' => env('HORIZON_DOMAIN'),

    /*-----------------------------------------------------------
    | Horizon Path
    |--------------------------------------------------------------------------
    |
    | This is the URI path where Horizon will be accessible from. Feel free
    | to change this path to anything you like. Note that the URI will not
    | affect the paths of its internal API that aren't exposed to users.
    |
    */

    'path' => env('HORIZON_PATH', 'horizon'),

    /*
    |--------------------------------------------------------------------------
    | Horizon Redis Connection
    |--------------------------------------------------------------------------
    |
    | This is the name of the Redis connection where Horizon will store the
    | meta information required for it to function. It includes the list
    | of supervisors, failed jobs, job metrics, and other information.
    |
    */

    'use' => 'default',

    /*
    |--------------------------------------------------------------------------
    | Horizon Redis Prefix
    |--------------------------------------------------------------------------
    |
    | This prefix will be used when storing all Horizon data in Redis. You
    | may modify the prefix when you are running multiple installations
    | of Horizon on the same server so that they don't have problems.
    |
    */

    'prefix' => env(
        'HORIZON_PREFIX',
        Str::slug(env('APP_URL', 'url'), '_') . Str::slug(env('APP_NAME', 'laravel'), '_').'_horizon:'

    ),

    /*
    |--------------------------------------------------------------------------
    | Horizon Route Middleware
    |--------------------------------------------------------------------------
    |
    | These middleware will get attached onto each Horizon route, giving you
    | the chance to add your own middleware to this list or change any of
    | the existing middleware. Or, you can simply stick with this list.
    |
    */

    'middleware' => ['web'],

    /*
    |--------------------------------------------------------------------------
    | Queue Wait Time Thresholds
    |--------------------------------------------------------------------------
    |
    | This option allows you to configure when the LongWaitDetected event
    | will be fired. Every connection / queue combination may have its
    | own, unique threshold (in seconds) before this event is fired.
    |
    */

    'waits' => [
        'redis:default' => 60,
    ],

    /*
    |--------------------------------------------------------------------------
    | Job Trimming Times
    |--------------------------------------------------------------------------
    |
    | Here you can configure for how long (in minutes) you desire Horizon to
    | persist the recent and failed jobs. Typically, recent jobs are kept
    | for one hour while all failed jobs are stored for an entire week.
    |
    */

    'trim' => [
        'recent' => 1,
        'pending' => 1,
        'completed' => 1,
        'recent_failed' => 5,
        'failed' => 5,
        'monitored' => 1,
    ],

    /*
    |--------------------------------------------------------------------------
    | Silenced Jobs
    |--------------------------------------------------------------------------
    |
    | Silencing a job will instruct Horizon to not place the job in the list
    | of completed jobs within the Horizon dashboard. This setting may be
    | used to fully remove any noisy jobs from the completed jobs list.
    |
    */

    'silenced' => [
        App\Jobs\ProcessAnalytic::class,
        App\Jobs\StoreApiRequestLog::class,
        App\Jobs\ProcessModelObserver::class,
        Laravel\Telescope\Jobs\ProcessPendingUpdates::class,
    ],

    /*
    |--------------------------------------------------------------------------
    | Metrics
    |--------------------------------------------------------------------------
    |
    | Here you can configure how many snapshots should be kept to display in
    | the metrics graph. This will get used in combination with Horizon's
    | `horizon:snapshot` schedule to define how long to retain metrics.
    |
    */

    'metrics' => [
        'trim_snapshots' => [
            'job' => 24,
            'queue' => 24,
        ],
    ],

    /*
    |--------------------------------------------------------------------------
    | Fast Termination
    |--------------------------------------------------------------------------
    |
    | When this option is enabled, Horizon's "terminate" command will not
    | wait on all of the workers to terminate unless the --wait option
    | is provided. Fast termination can shorten deployment delay by
    | allowing a new instance of Horizon to start while the last
    | instance will continue to terminate each of its workers.
    |
    */

    'fast_termination' => true,

    /*
    |--------------------------------------------------------------------------
    | Memory Limit (MB)
    |--------------------------------------------------------------------------
    |
    | This value describes the maximum amount of memory the Horizon master
    | supervisor may consume before it is terminated and restarted. For
    | configuring these limits on your workers, see the next section.
    |
    */

    'memory_limit' => 2048,

    /*
    |--------------------------------------------------------------------------
    | Queue Worker Configuration
    |--------------------------------------------------------------------------
    |
    | Here you may define the queue worker settings used by your application
    | in all environments. These supervisors and settings handle all your
    | queued jobs and will be provisioned by Horizon during deployment.
    |
    */

    'defaults' => [

        /**
         * Archive
         *
         * Archive
         */
        'supervisor-archive' => [
            'connection' => 'archive',
            'queue' => ['archive-prepare', 'archive-store'],
            'balance' => 'auto',
            'autoScalingStrategy' => 'time',
            'maxProcesses' => 1,
            'maxTime' => 60,
            'maxJobs' => 1000,
            'memory' => 512,
            'tries' => 1,
            'timeout' => 180,
            'nice' => 0,
            'sleep' => 3,
        ],

        /**
         * CSV exports
         *
         * CSV exports
         */
        'supervisor-csv-exports' => [
            'connection' => 'csv-exports',
            'queue' => ['csv-exports'],
            'balance' => false,
            'autoScalingStrategy' => 'time',
            'maxProcesses' => 1,
            'maxTime' => 60,
            'maxJobs' => 1000,
            'memory' => 2048,
            'tries' => 1,
            'timeout' => 700,
            'nice' => 0,
            'sleep' => 3,
        ],

        /**
         * Statistical jobs
         *
         * Jobs on this queue are optimised to run as quickly as possible.
         */
        'supervisor-statistics' => [
            'connection' => 'statistics',
            'queue' => ['statistics'],
            'balance' => 'auto',
            'autoScalingStrategy' => 'time',
            'maxProcesses' => 1,
            'maxTime' => 60,
            'maxJobs' => 1000,
            'memory' => 512,
            'tries' => 1,
            'timeout' => 40,
            'nice' => 0,
            'sleep' => 3,
        ],

        /**
         * Fast jobs
         *
         * Jobs on this queue are optimised to run as quickly as possible.
         */
        'supervisor-fast-jobs' => [
            'connection' => 'redis-short-running',
            'queue' => ['on-demand-runs-now', 'redirects', 'listeners', 'observers', 'notifications', 'redis-short-running', 'default'],
            'balance' => 'auto',
            'autoScalingStrategy' => 'time',
            'maxProcesses' => 1,
            'maxTime' => 60,
            'maxJobs' => 1000,
            'memory' => 512,
            'tries' => 1,
            'timeout' => 40,
            'nice' => 0,
            'sleep' => 0,
        ],

        /**
         * Slow jobs
         *
         * Jobs on this queue are allowed to run for a longer period of time.
         */
        'supervisor-slow-jobs' => [
            'connection' => 'redis-long-running',
            'queue' => ['on-demand-runs-now', 'applications', 'pingtrees', 'icicle', 'automations', 'redis-long-running',  'default'],
            'balance' => 'auto',
            'autoScalingStrategy' => 'time',
            'maxProcesses' => 1,
            'maxTime' => 60,
            'maxJobs' => 1000,
            'memory' => 512,
            'tries' => 1,
            'timeout' => 700,
            'nice' => 0,
            'sleep' => 0,
        ],
    ],

    'environments' => [
        'production' => [
            'supervisor-archive' => [
                'minProcesses' => 1,
                'maxProcesses' => 25,
                'balanceMaxShift' => 1,
                'balanceCooldown' => 1,
            ],
            'supervisor-csv-exports' => [
                'minProcesses' => 1,
                'maxProcesses' => 1,
                'balanceMaxShift' => 1,
                'balanceCooldown' => 1,
            ],
            'supervisor-statistics' => [
                'minProcesses' => 1,
                'maxProcesses' => 25,
                'balanceMaxShift' => 3,
                'balanceCooldown' => 2,
            ],
            'supervisor-fast-jobs' => [
                'minProcesses' => 1,
                'maxProcesses' => 100,
                'balanceMaxShift' => 3,
                'balanceCooldown' => 2,
            ],
            'supervisor-slow-jobs' => [
                'minProcesses' => 3,
                'maxProcesses' => 100,
                'balanceMaxShift' => 3,
                'balanceCooldown' => 2,
            ],
        ],

        'local' => [
            'supervisor-archive' => [
                'minProcesses' => 1,
                'maxProcesses' => 5,
                'balanceMaxShift' => 1,
                'balanceCooldown' => 1,
            ],
            'supervisor-csv-exports' => [
                'minProcesses' => 1,
                'maxProcesses' => 1,
                'balanceMaxShift' => 1,
                'balanceCooldown' => 1,
            ],
            'supervisor-statistics' => [
                'minProcesses' => 1,
                'maxProcesses' => 25,
                'balanceMaxShift' => 5,
                'balanceCooldown' => 1,
            ],
            'supervisor-fast-jobs' => [
                'minProcesses' => 1,
                'maxProcesses' => 40,
                'balanceMaxShift' => 5,
                'balanceCooldown' => 1,
            ],
            'supervisor-slow-jobs' => [
                'minProcesses' => 3,
                'maxProcesses' => 40,
                'balanceMaxShift' => 5,
                'balanceCooldown' => 1,
            ],
        ],
    ],
];

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions