BUG: Jobs don't start, seemingly randomly #126

rclough · 2014-11-21T17:25:57Z

So this elusive bug cropped up, which we originally didnt have the logging in place to figure out, but now that the new logging is in place, there's some more hints.

This morning, our first job, mysql_load kicked off at 00:30 EST, and ran to completion. The second job of the day wh-log-transfer, which started while the first was running, never actually kicked off. All other scheduled jobs from this point didnt start and didn't log.

Here's the core logs from that time. You can see at the end, at around 7:30am, someone got up and tried to rerun the job, which wasnt working. They then restarted dagobah and manually ran all the appropriate jobs

2014-11-21 00:30:00,779 INFO [dagobah.core.core]: Starting job: mysql_load
2014-11-21 00:30:00,781 INFO [dagobah.core.core]: Creating subtask for task: START
2014-11-21 00:30:00,788 INFO [dagobah.core.core]: Starting job: fetch_emails
2014-11-21 00:30:00,790 INFO [dagobah.core.core]: Creating subtask for task: fetch_todays_configured_mail
2014-11-21 00:30:03,288 INFO [dagobah.core.core]: Creating subtask for task: Clean up old stuff
2014-11-21 00:30:05,796 INFO [dagobah.core.core]: Creating subtask for task: Run mysql_load Job
2014-11-21 01:00:00,578 INFO [dagobah.core.core]: Starting job: wh-log-transfer
2014-11-21 01:04:31,179 INFO [dagobah.core.core]: Creating subtask for task: Summarize Job
2014-11-21 01:04:33,767 INFO [dagobah.core.core]: Creating subtask for task: Clean up this job run
2014-11-21 01:04:36,275 INFO [dagobah.core.core]: Creating subtask for task: END
2014-11-21 07:38:19,791 INFO [dagobah.core.core]: Starting job: wh-log-transfer
2014-11-21 07:39:04,537 INFO [dagobah.core.core]: Starting job: wh-log-transfer
2014-11-21 07:39:32,964 INFO [dagobah.core.core]: Starting job: wh-log-transfer
2014-11-21 07:43:44,960 INFO [dagobah.core.core]: Initializing dagobah......

I've good news and bad news.

Bad news: the relevant stack trace is NOT included in either the request log OR the core logs. (This should change)

Good news: the stack trace IS printed to stdout, which we capture

Bad news: we over write the stdout logs every time it restarts (we're working on fixing this)

Good news: My coworker grabbed the stacktrace before they restarted it. Here it is, unfortunately without context:

Traceback (most recent call last):
  File "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.6/site-packages/dagobah/core/components.py", line 114, in run
    job.start()
  File "/usr/lib/python2.6/site-packages/dagobah/core/core.py", line 388, in start
    self.initialize_snapshot()
  File "/usr/lib/python2.6/site-packages/dagobah/core/core.py", line 665, in initialize_snapshot
    + "first destroying old snapshot.")
DagobahError: Attempting to initialize DAG snapshot without first destroying old snapshot.

I haven't investigated yet, but it seems like somehow a DAG snapshot for some job was never destroyed when the job completed, and then it threw an exception, and did not recover from that exception.

The text was updated successfully, but these errors were encountered:

rclough · 2014-11-21T17:28:13Z

Also, interestingly damning, the request logs from around that time:

2014-11-20 21:08:34,588 INFO [werkzeug]: 192.168.30.203 - - [20/Nov/2014 21:08:34] "GET /api/job?job_name=User+Agents+(olap.ua_dim) HTTP/1.1" 200 -
2014-11-21 07:31:56,412 INFO [werkzeug]: 192.168.30.220 - - [21/Nov/2014 07:31:56] "GET / HTTP/1.1" 302 -
2014-11-21 07:31:56,896 INFO [werkzeug]: 192.168.30.220 - - [21/Nov/2014 07:31:56] "GET /jobs HTTP/1.1" 200 -
2014-11-21 07:31:57,071 INFO [werkzeug]: 192.168.30.220 - - [21/Nov/2014 07:31:57] "GET /static/css/kickstrap.css HTTP/1.1" 200 -

You can see when I left work (stopped polling some job page I had open), and then NO request logs until after dagobah restarted (despite core logs showing that someone loaded a job page and tried to manually start it)

thieman · 2014-11-21T17:46:49Z

Core should have logged a line Initializing DAG snapshot for job {} right before this exception, are you seeing that anywhere?

rclough · 2014-11-21T17:48:48Z

I dont see that anywhere in the core log at all

thieman · 2014-11-21T17:50:51Z

Crap, the timer thread must not be able to write to the core logs, then.

thieman · 2014-11-21T17:54:25Z

Oh actually are you not running with loglevel debug? Can you try with that, you should get a lot more logging output. It's a new option in the config yml.

rclough · 2014-11-21T17:57:05Z

yeah, we're logging at info level. Changing that now.

rclough · 2014-11-24T22:04:44Z

For documentation purposes (most of this was discussed offline):

The current theory is that _on_completion doesn't reach the end of the function due to a backend exception (in this case, having log output that is too large), leaving both the thread lock unreleased, and the DAG snapshot undestroyed.

Current solution is twofold:

Change the lock acquisition to use a context manager, so that even if the code errors out, the lock is eventually released
Stop raising an exception when the snapshot already exists, since this is only important for a future feature, and overwriting it should not impact dagobah as it stands presently. Instead, a warning will be issued in the logs.

thieman · 2014-11-24T22:37:35Z

So there were two issues here:

Giant logs were causing an exception in _on_completion, which made it so a given Job could not complete successfully.
That exception was causing the completion lock to never get released, so future runs could not proceed.

PR #128 fixed the second issue. @rclough please open a separate issue for the logging errors.

thieman mentioned this issue Nov 21, 2014

Send exceptions to core log #127

Open

rclough mentioned this issue Nov 24, 2014

Improve job completion #128

Merged

thieman closed this as completed Nov 24, 2014

This was referenced Nov 24, 2014

Large logs cause mongo to error out #129

Closed

Dagobah stuck in "running" state after task failure #130

Open

This was referenced Jul 11, 2019

Send exceptions to core log gsksivesh/dagobah#35

Open

Dagobah stuck in "running" state after task failure gsksivesh/dagobah#36

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Jobs don't start, seemingly randomly #126

BUG: Jobs don't start, seemingly randomly #126

rclough commented Nov 21, 2014

rclough commented Nov 21, 2014

thieman commented Nov 21, 2014

rclough commented Nov 21, 2014

thieman commented Nov 21, 2014

thieman commented Nov 21, 2014

rclough commented Nov 21, 2014

rclough commented Nov 24, 2014

thieman commented Nov 24, 2014

BUG: Jobs don't start, seemingly randomly #126

BUG: Jobs don't start, seemingly randomly #126

Comments

rclough commented Nov 21, 2014

rclough commented Nov 21, 2014

thieman commented Nov 21, 2014

rclough commented Nov 21, 2014

thieman commented Nov 21, 2014

thieman commented Nov 21, 2014

rclough commented Nov 21, 2014

rclough commented Nov 24, 2014

thieman commented Nov 24, 2014