-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Jobs don't start, seemingly randomly #126
Comments
Also, interestingly damning, the request logs from around that time:
You can see when I left work (stopped polling some job page I had open), and then NO request logs until after dagobah restarted (despite core logs showing that someone loaded a job page and tried to manually start it) |
Core should have logged a line |
I dont see that anywhere in the core log at all |
Crap, the timer thread must not be able to write to the core logs, then. |
Oh actually are you not running with loglevel debug? Can you try with that, you should get a lot more logging output. It's a new option in the config yml. |
yeah, we're logging at info level. Changing that now. |
For documentation purposes (most of this was discussed offline): The current theory is that Current solution is twofold:
|
So there were two issues here:
PR #128 fixed the second issue. @rclough please open a separate issue for the logging errors. |
So this elusive bug cropped up, which we originally didnt have the logging in place to figure out, but now that the new logging is in place, there's some more hints.
This morning, our first job,
mysql_load
kicked off at 00:30 EST, and ran to completion. The second job of the daywh-log-transfer
, which started while the first was running, never actually kicked off. All other scheduled jobs from this point didnt start and didn't log.Here's the core logs from that time. You can see at the end, at around 7:30am, someone got up and tried to rerun the job, which wasnt working. They then restarted dagobah and manually ran all the appropriate jobs
I've good news and bad news.
Bad news: the relevant stack trace is NOT included in either the request log OR the core logs. (This should change)
Good news: the stack trace IS printed to stdout, which we capture
Bad news: we over write the stdout logs every time it restarts (we're working on fixing this)
Good news: My coworker grabbed the stacktrace before they restarted it. Here it is, unfortunately without context:
I haven't investigated yet, but it seems like somehow a DAG snapshot for some job was never destroyed when the job completed, and then it threw an exception, and did not recover from that exception.
The text was updated successfully, but these errors were encountered: