-
Notifications
You must be signed in to change notification settings - Fork 662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobs lost randomly #857
Comments
Please let us know if you are able to reliably replicate it. Otherwise it will be difficult to proceed. |
we had this issue in the past, and we solved by increasing the Redis instances on aws. |
Our investigation led us to think that when Redis runs out of memory jobs are being lost, as Laravel Horizon does not check the Redis response, so if there is an error it just ignores it and goes on as if the job was pushed on the queue correctly. Try to check if Redis runs out of memory, as we saw that this can result in jobs randomly not pushed into the queue. Going into details of what we think the issue was, everything starts with Now the However the You see them in the Horizon dashboard because the data shown in the GUI is not the same data of the queue, so you seeing a job in the horizon dashboard does not guarantee you that the job is actually in the Redis processing queue. Please correct me if something I stated is wrong. |
thanks @enzolarosa & @mfbonfigli I've upgraded our instance, at least that will rule out the memory issue. Logically I see it being possible, but such a small chance that the job survives each time but not the queue, but I agree that the return should be taken care of in some sense. We are planning to make a new redis driver with working options for serializer + compression, one can dig deep to see why it doesn't work with the current raw commands laravel/framework#31182, so adding the check on return would also be a nice addon I believe. Let's hope the issue was memory only so we can focus on creating instead of debugging :) |
Hope that helps, we know how hard it is to debug these issues :) Anyway, we noticed that when the Using this function may be useful for you to verify if (and what) the issue is on the Redis side while pushing to the queue, regardless of what the underlying cause is (OOM or something else). |
So far so good, so closing the ticket and hopefully we can look forward to a proper driver in the future with sanity checks of sorts. Thanks for the community replies, wouldn't have thought of memory much as we weren't getting any errors from redis nor did we hit the limit 100%, but guessing we were close at times and volatile-lru randomly chose to kill of some important bits and pieces :) |
Thanks all! : ) |
Description:
We're getting random 'holes' of jobs parsed, we had some customers noticing it, and once we setup monitoring we could see that indeed they were right.
We've spent quite a few days but have reached a point where we need assistance. We cannot replicate the issue, and digging into the LUA scripts for the procedures have resulted in nothing usable.
We can only see that the key horizon:JOBID exists, but it's nowhere to be found in the necessary keys like horizon:queues:QUEUE, horizon_pending_jobs or horizon:recent_jobs.
We cannot replicate this locally, and can't see how it could be done as the frist thing that gets added is the horizon:queues push, and later on the Jobid key.
Our environment in production is 1 Redis server 5.0.4 on ubuntu 20.04, we have a k8s with horizon servers setup and are eating quite a few jobs. Please disregard of the failed jobs as this is the result of testing various low timeouts according to other issues, but this isn't the case as it's stuck in Pending state, since it never has any object to mo move over to the queues:reserved
So it could be some kind of race condition that we cannot replicate locally. But we see this happening 2-4 times daily on monitored tags.
Redis value for JobId:
{"id":"529205628","connection":"redis","created_at":"1593435603.7258","name":"Vemcogroup\\SensorParser\\Jobs\\SensorStreamParser","status":"pending","payload":"...s:3:\\\"job\\\";N;s:10:\\\"connection\\\";N;s:5:\\\"queue\\\";s:11:\\\"metric-data\\\";s:15:\\\"chainConnection\\\";N;s:10:\\\"chainQueue\\\";N;s:5:\\\"delay\\\";N;s:10:\\\"middleware\\\";a:0:{}s:7:\\\"chained\\\";a:0:{}}\"},\"id\":\"529205628\",\"attempts\":0,\"type\":\"job\",\"tags\":[\"xovis\",\"xovis-54:10:ec:b6:e5:94\",\"54:10:ec:b6:e5:94\"],\"pushedAt\":\"1593435603.7241\"}","queue":"metric-data","updated_at":"1593435603.7258"}
The text was updated successfully, but these errors were encountered: