You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue by rclough Monday Dec 01, 2014 at 17:33 GMT Originally opened as thieman#130
I'm like 90% sure this is related to thieman#129 and thieman#126, but for posterities sake I'm documenting the issue, and will be investigating today.
Details
We have this one job, wh-log-transfer, which has been not sending emails due to thieman#126. In order to mitigate this, we tried redirecting the logging output to a file instead of to stdout.
When doing this we actually caused a subsequent step to fail which depended on that output (or something), but instead of failing, the task was marked as "Running" indefinitely. Because of this, the job didn't run for a couple days (because the job won't rerun if marked as "Running" already) and nobody noticed because Thanksgiving and such.
The text was updated successfully, but these errors were encountered:
Comment by rclough Monday Dec 01, 2014 at 17:39 GMT
Suspicions
We are not running the new code from thieman#128 in prod yet, we are running a patched version which just releases the thread lock if the dag still exists.
What I think is happening is, the task fails, and goes to send an email with the failure and log output. This hits that backend error code, causing it to error out before the task can be marked as "Failed".
I'm going to be digging into the code today to verify this and check if thieman#128 addresses it. If not, I'll submit a patch.
Comment by rclough Monday Dec 01, 2014 at 18:21 GMT
Also, would be a good idea to send an email if a cron scheduled job isnt being run because it's always in a running state (or have this option) because that seems like an exceptional state one should know about.
Issue by rclough
Monday Dec 01, 2014 at 17:33 GMT
Originally opened as thieman#130
I'm like 90% sure this is related to thieman#129 and thieman#126, but for posterities sake I'm documenting the issue, and will be investigating today.
Details
We have this one job,
wh-log-transfer
, which has been not sending emails due to thieman#126. In order to mitigate this, we tried redirecting the logging output to a file instead of to stdout.When doing this we actually caused a subsequent step to fail which depended on that output (or something), but instead of failing, the task was marked as "Running" indefinitely. Because of this, the job didn't run for a couple days (because the job won't rerun if marked as "Running" already) and nobody noticed because Thanksgiving and such.
The text was updated successfully, but these errors were encountered: