Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUGFIX: Remove from InProgress if we don't addToDead #92

Merged
merged 3 commits into from
May 3, 2018

Conversation

davars
Copy link
Contributor

@davars davars commented Apr 2, 2018

Tracked down why locks were being retained by jobs that weren't running to this branch. First commit is a simple fix. Second commit is a bit more involved and pulls removeJobFromInProgress out of any branches to ensure it always runs.

Commit Messages:

  • Remove from InProgress if we don't addToDead

  • removeJobFromInProgress critically needs to run when a job is dispatched regardless of the outcome; otherwise locks are retained for the lifetime of the worker. The easiest way to ensure it's called in every branch is to remove it from branches entirely. Its behavior can be modified selectively by passing in a closure, which can perform redis operations inside the same transaction as the common housekeeping ops.

  • Consistent error messages

…hed regardless of the outcome; otherwise locks are retained for the lifetime of the worker. The easiest way to ensure it's called in every branch is to remove it from branches entirely. Its behavior can be modified selectively by passing in a closure, which can perform redis operations inside the same transaction as the common housekeeping ops.
@davars
Copy link
Contributor Author

davars commented Apr 3, 2018

I just realized I never ran before / after benchmarks, so caveat emptor. I plan to do that this morning.

@davars
Copy link
Contributor Author

davars commented Apr 3, 2018

Not exactly statistically rigorous, but doesn't seem to be much difference (mine -> built with this PR, theirs -> built with 85f9368): https://gist.github.com/davars/f70ed4f9f2b2c90429d0b94054c55afa

@shdunning
Copy link
Collaborator

I plan to spend some time reviewing this today

Copy link
Collaborator

@shdunning shdunning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than my minor nitpick about changing the text of the error keys, this looks great.

worker.go Outdated

func (w *worker) addToRetry(job *Job, runErr error) {
func terminateOnly(_ redis.Conn) { return }
func terminateAndRetry(w *worker, jt *jobType, job *Job) terminateOp {
rawJSON, err := job.serialize()
if err != nil {
logError("worker.add_to_retry", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plz change the error key to worker.terminate_and_retry.serialize.

worker.go Outdated
rawJSON, err := job.serialize()

if err != nil {
logError("worker.add_to_dead.serialize", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plz change error key to worker.terminate_and_dead.serialize.

@@ -39,6 +40,13 @@ type jobType struct {
DynamicHandler reflect.Value
}

func (jt *jobType) calcBackoff(j *Job) int64 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! thanks for adding this.

@shdunning
Copy link
Collaborator

So @davars was the issue that when we exhausted retries the job and the job was marked SkipDead = true then it was never removed from the in progress queue?

It looks like the redis commands used by removeJobFromInProgress were also in addToDead and addToRetry, and the only path through the code where I can see the job might not be removed from the in progress queue is the aforementioned.

Aside from this being a nice refactor, I'm guessing you tracked down a specific path that lead to dead jobs not being cleaned up properly.

@davars
Copy link
Contributor Author

davars commented May 3, 2018

You're correct. In my setup I have my cron jobs enqueued with MaxFails: 1 and SkipDead: true. I was observing those queues getting stuck after an error until that pool was shut down (when the reaper would clean up the lock, presumably).

Working on the error messages now.

Side note, there's a rare case where locks get stuck and aren't cleaned up by the reaper. I haven't tracked that one down yet. I suspect it may be due to concurrent reapers, since it looks like each pool runs reapers without coordination. Maybe the reaper task should be an internal job that's run by the scheduler. Then it would get a job id, lock, be run by one pool per period, etc.

Copy link
Collaborator

@shdunning shdunning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@shdunning shdunning merged commit 72c8f57 into gocraft:master May 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants