Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dagobah stuck in "running" state after task failure #130

Open
rclough opened this issue Dec 1, 2014 · 2 comments
Open

Dagobah stuck in "running" state after task failure #130

rclough opened this issue Dec 1, 2014 · 2 comments

Comments

@rclough
Copy link
Collaborator

rclough commented Dec 1, 2014

I'm like 90% sure this is related to #129 and #126, but for posterities sake I'm documenting the issue, and will be investigating today.

Details

We have this one job, wh-log-transfer, which has been not sending emails due to #126. In order to mitigate this, we tried redirecting the logging output to a file instead of to stdout.

When doing this we actually caused a subsequent step to fail which depended on that output (or something), but instead of failing, the task was marked as "Running" indefinitely. Because of this, the job didn't run for a couple days (because the job won't rerun if marked as "Running" already) and nobody noticed because Thanksgiving and such.

@rclough
Copy link
Collaborator Author

rclough commented Dec 1, 2014

Suspicions

We are not running the new code from #128 in prod yet, we are running a patched version which just releases the thread lock if the dag still exists.

What I think is happening is, the task fails, and goes to send an email with the failure and log output. This hits that backend error code, causing it to error out before the task can be marked as "Failed".

I'm going to be digging into the code today to verify this and check if #128 addresses it. If not, I'll submit a patch.

@rclough
Copy link
Collaborator Author

rclough commented Dec 1, 2014

Also, would be a good idea to send an email if a cron scheduled job isnt being run because it's always in a running state (or have this option) because that seems like an exceptional state one should know about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant