Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Jobs don't start, seemingly randomly #126

Closed
rclough opened this issue Nov 21, 2014 · 8 comments
Closed

BUG: Jobs don't start, seemingly randomly #126

rclough opened this issue Nov 21, 2014 · 8 comments

Comments

@rclough
Copy link
Collaborator

rclough commented Nov 21, 2014

So this elusive bug cropped up, which we originally didnt have the logging in place to figure out, but now that the new logging is in place, there's some more hints.

This morning, our first job, mysql_load kicked off at 00:30 EST, and ran to completion. The second job of the day wh-log-transfer, which started while the first was running, never actually kicked off. All other scheduled jobs from this point didnt start and didn't log.

Here's the core logs from that time. You can see at the end, at around 7:30am, someone got up and tried to rerun the job, which wasnt working. They then restarted dagobah and manually ran all the appropriate jobs

2014-11-21 00:30:00,779 INFO [dagobah.core.core]: Starting job: mysql_load
2014-11-21 00:30:00,781 INFO [dagobah.core.core]: Creating subtask for task: START
2014-11-21 00:30:00,788 INFO [dagobah.core.core]: Starting job: fetch_emails
2014-11-21 00:30:00,790 INFO [dagobah.core.core]: Creating subtask for task: fetch_todays_configured_mail
2014-11-21 00:30:03,288 INFO [dagobah.core.core]: Creating subtask for task: Clean up old stuff
2014-11-21 00:30:05,796 INFO [dagobah.core.core]: Creating subtask for task: Run mysql_load Job
2014-11-21 01:00:00,578 INFO [dagobah.core.core]: Starting job: wh-log-transfer
2014-11-21 01:04:31,179 INFO [dagobah.core.core]: Creating subtask for task: Summarize Job
2014-11-21 01:04:33,767 INFO [dagobah.core.core]: Creating subtask for task: Clean up this job run
2014-11-21 01:04:36,275 INFO [dagobah.core.core]: Creating subtask for task: END
2014-11-21 07:38:19,791 INFO [dagobah.core.core]: Starting job: wh-log-transfer
2014-11-21 07:39:04,537 INFO [dagobah.core.core]: Starting job: wh-log-transfer
2014-11-21 07:39:32,964 INFO [dagobah.core.core]: Starting job: wh-log-transfer
2014-11-21 07:43:44,960 INFO [dagobah.core.core]: Initializing dagobah......

I've good news and bad news.

Bad news: the relevant stack trace is NOT included in either the request log OR the core logs. (This should change)

Good news: the stack trace IS printed to stdout, which we capture

Bad news: we over write the stdout logs every time it restarts (we're working on fixing this)

Good news: My coworker grabbed the stacktrace before they restarted it. Here it is, unfortunately without context:

Traceback (most recent call last):
  File "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.6/site-packages/dagobah/core/components.py", line 114, in run
    job.start()
  File "/usr/lib/python2.6/site-packages/dagobah/core/core.py", line 388, in start
    self.initialize_snapshot()
  File "/usr/lib/python2.6/site-packages/dagobah/core/core.py", line 665, in initialize_snapshot
    + "first destroying old snapshot.")
DagobahError: Attempting to initialize DAG snapshot without first destroying old snapshot.

I haven't investigated yet, but it seems like somehow a DAG snapshot for some job was never destroyed when the job completed, and then it threw an exception, and did not recover from that exception.

@rclough
Copy link
Collaborator Author

rclough commented Nov 21, 2014

Also, interestingly damning, the request logs from around that time:

2014-11-20 21:08:34,588 INFO [werkzeug]: 192.168.30.203 - - [20/Nov/2014 21:08:34] "GET /api/job?job_name=User+Agents+(olap.ua_dim) HTTP/1.1" 200 -
2014-11-21 07:31:56,412 INFO [werkzeug]: 192.168.30.220 - - [21/Nov/2014 07:31:56] "GET / HTTP/1.1" 302 -
2014-11-21 07:31:56,896 INFO [werkzeug]: 192.168.30.220 - - [21/Nov/2014 07:31:56] "GET /jobs HTTP/1.1" 200 -
2014-11-21 07:31:57,071 INFO [werkzeug]: 192.168.30.220 - - [21/Nov/2014 07:31:57] "GET /static/css/kickstrap.css HTTP/1.1" 200 -

You can see when I left work (stopped polling some job page I had open), and then NO request logs until after dagobah restarted (despite core logs showing that someone loaded a job page and tried to manually start it)

@thieman
Copy link
Owner

thieman commented Nov 21, 2014

Core should have logged a line Initializing DAG snapshot for job {} right before this exception, are you seeing that anywhere?

@rclough
Copy link
Collaborator Author

rclough commented Nov 21, 2014

I dont see that anywhere in the core log at all

@thieman
Copy link
Owner

thieman commented Nov 21, 2014

Crap, the timer thread must not be able to write to the core logs, then.

@thieman
Copy link
Owner

thieman commented Nov 21, 2014

Oh actually are you not running with loglevel debug? Can you try with that, you should get a lot more logging output. It's a new option in the config yml.

@rclough
Copy link
Collaborator Author

rclough commented Nov 21, 2014

yeah, we're logging at info level. Changing that now.

@rclough
Copy link
Collaborator Author

rclough commented Nov 24, 2014

For documentation purposes (most of this was discussed offline):

The current theory is that _on_completion doesn't reach the end of the function due to a backend exception (in this case, having log output that is too large), leaving both the thread lock unreleased, and the DAG snapshot undestroyed.

Current solution is twofold:

  1. Change the lock acquisition to use a context manager, so that even if the code errors out, the lock is eventually released
  2. Stop raising an exception when the snapshot already exists, since this is only important for a future feature, and overwriting it should not impact dagobah as it stands presently. Instead, a warning will be issued in the logs.

@thieman
Copy link
Owner

thieman commented Nov 24, 2014

So there were two issues here:

  1. Giant logs were causing an exception in _on_completion, which made it so a given Job could not complete successfully.
  2. That exception was causing the completion lock to never get released, so future runs could not proceed.

PR #128 fixed the second issue. @rclough please open a separate issue for the logging errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants