Jobs lost randomly #857

danijelk · 2020-06-29T13:39:20Z

Horizon Version: 4.3.3
Laravel Version: 7.17.2
PHP Version: 7.4.7
Redis Driver & Version: phpredis 5.2.2

Description:

We're getting random 'holes' of jobs parsed, we had some customers noticing it, and once we setup monitoring we could see that indeed they were right.

We've spent quite a few days but have reached a point where we need assistance. We cannot replicate the issue, and digging into the LUA scripts for the procedures have resulted in nothing usable.

We can only see that the key horizon:JOBID exists, but it's nowhere to be found in the necessary keys like horizon:queues:QUEUE, horizon_pending_jobs or horizon:recent_jobs.

We cannot replicate this locally, and can't see how it could be done as the frist thing that gets added is the horizon:queues push, and later on the Jobid key.

Our environment in production is 1 Redis server 5.0.4 on ubuntu 20.04, we have a k8s with horizon servers setup and are eating quite a few jobs. Please disregard of the failed jobs as this is the result of testing various low timeouts according to other issues, but this isn't the case as it's stuck in Pending state, since it never has any object to mo move over to the queues:reserved

So it could be some kind of race condition that we cannot replicate locally. But we see this happening 2-4 times daily on monitored tags.

Redis value for JobId:
{"id":"529205628","connection":"redis","created_at":"1593435603.7258","name":"Vemcogroup\\SensorParser\\Jobs\\SensorStreamParser","status":"pending","payload":"...s:3:\\\"job\\\";N;s:10:\\\"connection\\\";N;s:5:\\\"queue\\\";s:11:\\\"metric-data\\\";s:15:\\\"chainConnection\\\";N;s:10:\\\"chainQueue\\\";N;s:5:\\\"delay\\\";N;s:10:\\\"middleware\\\";a:0:{}s:7:\\\"chained\\\";a:0:{}}\"},\"id\":\"529205628\",\"attempts\":0,\"type\":\"job\",\"tags\":[\"xovis\",\"xovis-54:10:ec:b6:e5:94\",\"54:10:ec:b6:e5:94\"],\"pushedAt\":\"1593435603.7241\"}","queue":"metric-data","updated_at":"1593435603.7258"}

The text was updated successfully, but these errors were encountered:

taylorotwell · 2020-06-29T15:11:44Z

Please let us know if you are able to reliably replicate it. Otherwise it will be difficult to proceed.

enzolarosa · 2020-06-29T16:00:41Z

we had this issue in the past, and we solved by increasing the Redis instances on aws.
can you share the piece of code where you think the issue was @mfbonfigli?

mfbonfigli · 2020-06-29T16:30:02Z

Our investigation led us to think that when Redis runs out of memory jobs are being lost, as Laravel Horizon does not check the Redis response, so if there is an error it just ignores it and goes on as if the job was pushed on the queue correctly.

Try to check if Redis runs out of memory, as we saw that this can result in jobs randomly not pushed into the queue.
Randomly means that jobs are silently dropped until memory is freed. When there is space available new jobs will be pushed, but old, not pushed, ones will be lost.

Going into details of what we think the issue was, everything starts with phpredis as Horizon indirectly communicates with Redis using phpredis. According to the docs of phpredis:
phpredis throws a RedisException object if it can't reach the Redis server (...) In any other problematic case that does not involve an unreachable server (such as a key not existing, an invalid command, etc), phpredis will return FALSE

Now the command function on PhpRedisConnection.php only intercepts the RedisException class of errors, but all other errors (the FALSE return value) should be manually checked by controlling the value returned by phpredis.

However the pushRaw function in RedisQueue.php does not check the return value of the eval command (which calls the aforementioned command function), thus when Redis fails due to OOM issues Horizon behaves as if the job was correctly pushed to the queue, but it was not.

You see them in the Horizon dashboard because the data shown in the GUI is not the same data of the queue, so you seeing a job in the horizon dashboard does not guarantee you that the job is actually in the Redis processing queue.

Please correct me if something I stated is wrong.

danijelk · 2020-06-29T17:43:59Z

thanks @enzolarosa & @mfbonfigli I've upgraded our instance, at least that will rule out the memory issue. Logically I see it being possible, but such a small chance that the job survives each time but not the queue, but I agree that the return should be taken care of in some sense.

We are planning to make a new redis driver with working options for serializer + compression, one can dig deep to see why it doesn't work with the current raw commands laravel/framework#31182, so adding the check on return would also be a nice addon I believe.

Let's hope the issue was memory only so we can focus on creating instead of debugging :)

mfbonfigli · 2020-06-29T20:00:15Z

Hope that helps, we know how hard it is to debug these issues :)

Anyway, we noticed that when the push was failing (returning FALSE), calling getLastError function on the phpredis client was returning this message -OOM command not allowed when used memory > 'maxmemory'

Using this function may be useful for you to verify if (and what) the issue is on the Redis side while pushing to the queue, regardless of what the underlying cause is (OOM or something else).

danijelk · 2020-07-01T05:51:44Z

So far so good, so closing the ticket and hopefully we can look forward to a proper driver in the future with sanity checks of sorts.

Thanks for the community replies, wouldn't have thought of memory much as we weren't getting any errors from redis nor did we hit the limit 100%, but guessing we were close at times and volatile-lru randomly chose to kill of some important bits and pieces :)

driesvints · 2020-07-01T14:47:09Z

Thanks all! : )

driesvints added the needs more info label Jun 29, 2020

danijelk closed this as completed Jul 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs lost randomly #857

Jobs lost randomly #857

danijelk commented Jun 29, 2020

taylorotwell commented Jun 29, 2020

enzolarosa commented Jun 29, 2020 •

edited

Loading

mfbonfigli commented Jun 29, 2020

danijelk commented Jun 29, 2020

mfbonfigli commented Jun 29, 2020 •

edited

Loading

danijelk commented Jul 1, 2020

driesvints commented Jul 1, 2020

Jobs lost randomly #857

Jobs lost randomly #857

Comments

danijelk commented Jun 29, 2020

Description:

taylorotwell commented Jun 29, 2020

enzolarosa commented Jun 29, 2020 • edited Loading

mfbonfigli commented Jun 29, 2020

danijelk commented Jun 29, 2020

mfbonfigli commented Jun 29, 2020 • edited Loading

danijelk commented Jul 1, 2020

driesvints commented Jul 1, 2020

enzolarosa commented Jun 29, 2020 •

edited

Loading

mfbonfigli commented Jun 29, 2020 •

edited

Loading