Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs lost randomly #857

Closed
danijelk opened this issue Jun 29, 2020 · 7 comments
Closed

Jobs lost randomly #857

danijelk opened this issue Jun 29, 2020 · 7 comments

Comments

@danijelk
Copy link
Contributor

  • Horizon Version: 4.3.3
  • Laravel Version: 7.17.2
  • PHP Version: 7.4.7
  • Redis Driver & Version: phpredis 5.2.2

Description:

We're getting random 'holes' of jobs parsed, we had some customers noticing it, and once we setup monitoring we could see that indeed they were right.
Screenshot 2020-06-29 at 15 18 37

Screenshot 2020-06-29 at 15 37 36

We've spent quite a few days but have reached a point where we need assistance. We cannot replicate the issue, and digging into the LUA scripts for the procedures have resulted in nothing usable.

We can only see that the key horizon:JOBID exists, but it's nowhere to be found in the necessary keys like horizon:queues:QUEUE, horizon_pending_jobs or horizon:recent_jobs.

We cannot replicate this locally, and can't see how it could be done as the frist thing that gets added is the horizon:queues push, and later on the Jobid key.

Our environment in production is 1 Redis server 5.0.4 on ubuntu 20.04, we have a k8s with horizon servers setup and are eating quite a few jobs. Please disregard of the failed jobs as this is the result of testing various low timeouts according to other issues, but this isn't the case as it's stuck in Pending state, since it never has any object to mo move over to the queues:reserved
Screenshot 2020-06-29 at 15 28 11

So it could be some kind of race condition that we cannot replicate locally. But we see this happening 2-4 times daily on monitored tags.

Redis value for JobId:
{"id":"529205628","connection":"redis","created_at":"1593435603.7258","name":"Vemcogroup\\SensorParser\\Jobs\\SensorStreamParser","status":"pending","payload":"...s:3:\\\"job\\\";N;s:10:\\\"connection\\\";N;s:5:\\\"queue\\\";s:11:\\\"metric-data\\\";s:15:\\\"chainConnection\\\";N;s:10:\\\"chainQueue\\\";N;s:5:\\\"delay\\\";N;s:10:\\\"middleware\\\";a:0:{}s:7:\\\"chained\\\";a:0:{}}\"},\"id\":\"529205628\",\"attempts\":0,\"type\":\"job\",\"tags\":[\"xovis\",\"xovis-54:10:ec:b6:e5:94\",\"54:10:ec:b6:e5:94\"],\"pushedAt\":\"1593435603.7241\"}","queue":"metric-data","updated_at":"1593435603.7258"}

@taylorotwell
Copy link
Member

Please let us know if you are able to reliably replicate it. Otherwise it will be difficult to proceed.

@enzolarosa
Copy link

enzolarosa commented Jun 29, 2020

we had this issue in the past, and we solved by increasing the Redis instances on aws.
can you share the piece of code where you think the issue was @mfbonfigli?

@mfbonfigli
Copy link

Our investigation led us to think that when Redis runs out of memory jobs are being lost, as Laravel Horizon does not check the Redis response, so if there is an error it just ignores it and goes on as if the job was pushed on the queue correctly.

Try to check if Redis runs out of memory, as we saw that this can result in jobs randomly not pushed into the queue.
Randomly means that jobs are silently dropped until memory is freed. When there is space available new jobs will be pushed, but old, not pushed, ones will be lost.

Going into details of what we think the issue was, everything starts with phpredis as Horizon indirectly communicates with Redis using phpredis. According to the docs of phpredis:
phpredis throws a RedisException object if it can't reach the Redis server (...) In any other problematic case that does not involve an unreachable server (such as a key not existing, an invalid command, etc), phpredis will return FALSE

Now the command function on PhpRedisConnection.php only intercepts the RedisException class of errors, but all other errors (the FALSE return value) should be manually checked by controlling the value returned by phpredis.

However the pushRaw function in RedisQueue.php does not check the return value of the eval command (which calls the aforementioned command function), thus when Redis fails due to OOM issues Horizon behaves as if the job was correctly pushed to the queue, but it was not.

You see them in the Horizon dashboard because the data shown in the GUI is not the same data of the queue, so you seeing a job in the horizon dashboard does not guarantee you that the job is actually in the Redis processing queue.

Please correct me if something I stated is wrong.

@danijelk
Copy link
Contributor Author

thanks @enzolarosa & @mfbonfigli I've upgraded our instance, at least that will rule out the memory issue. Logically I see it being possible, but such a small chance that the job survives each time but not the queue, but I agree that the return should be taken care of in some sense.

We are planning to make a new redis driver with working options for serializer + compression, one can dig deep to see why it doesn't work with the current raw commands laravel/framework#31182, so adding the check on return would also be a nice addon I believe.

Let's hope the issue was memory only so we can focus on creating instead of debugging :)

@mfbonfigli
Copy link

mfbonfigli commented Jun 29, 2020

Hope that helps, we know how hard it is to debug these issues :)

Anyway, we noticed that when the push was failing (returning FALSE), calling getLastError function on the phpredis client was returning this message -OOM command not allowed when used memory > 'maxmemory'

Using this function may be useful for you to verify if (and what) the issue is on the Redis side while pushing to the queue, regardless of what the underlying cause is (OOM or something else).

@danijelk
Copy link
Contributor Author

danijelk commented Jul 1, 2020

So far so good, so closing the ticket and hopefully we can look forward to a proper driver in the future with sanity checks of sorts.

Thanks for the community replies, wouldn't have thought of memory much as we weren't getting any errors from redis nor did we hit the limit 100%, but guessing we were close at times and volatile-lru randomly chose to kill of some important bits and pieces :)

@danijelk danijelk closed this as completed Jul 1, 2020
@driesvints
Copy link
Member

Thanks all! : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants