Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce unknown job status to handle communication problem with resource manager #179

Closed
wants to merge 8 commits into from

Conversation

cmd-ntrf
Copy link
Contributor

This is a draft PR to handle the case where the resource manager does not return a status that is valid for BatchSpawner, but the error message returned indicate a problem with the resource manager and not the job.

This happens quite often with Slurm where squeue would return slurm_load_jobs error: Socket timed out on send/recv. The notebook job is still running fine, squeue was just not able to query its state. Currently, batchspawner will clear the state if it cannot query it, and the job will be subsequently cancelled, which is inconvenient. This problem has been exposed in #171 and #178.

The proposed solution is to refactor the job status querying to return a JobStatus object instead of the job_status string. We introduce four states: NOTQUEUED, RUNNING, PENDING, and UNKNOWN. Because the function deal with job status and not state, the method read_job_state is renamed query_job_status. This also makes the method names more consistent with the documentation.

Instead of calling the job_is* function, query_status_job returns a JobStatus object which can be used to determine what should be done next. The unknown status is determined by state_isunknown function, and for now only SlurmSpawner implements a regex to handle the cases where the query command exit code is 1.

Copy link
Contributor

@rkdarst rkdarst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea. Even if it wasn't needed, I think it simplifies some logic I had wanted to simplify anyway. There are some miscellaneous suggestions which I added line-by-line.

Anyway, I would recommend accepting, we can remove the TMPFAIL state later if there turns out to be a better way.

Other suggestions:

  • allow some number of TMPFAILs before it actually returns NOTPRESENT.
  • We could ask Min if a "allow n failures" could be added to jupyterhub itself. I think that spawner.py:start_polling is the place to look to do it.
  • It would be nice to add some generic caching layer, but do you think anyone would do it in a quick timeframe? I doubt I would, this seems like it works well enough for now for what I need...

I'll work to make tests pass if @cmd-ntrf accepts or rejects the suggestions below.

Thanks for the good work!

batchspawner/batchspawner.py Outdated Show resolved Hide resolved
batchspawner/batchspawner.py Show resolved Hide resolved
self.job_status = out
self.job_status = await self.run_command(cmd)
except RuntimeError as e:
self.job_status = e.args[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.job_status = e.args[0]
# e.args[0] is stderr from the process
self.job_status = e.args[0]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this change doesn't appear, or am I misinterperting something?

batchspawner/batchspawner.py Outdated Show resolved Hide resolved
@@ -326,22 +340,20 @@ def state_isrunning(self):
"Return boolean indicating if job is running, likely by parsing self.job_status"
raise NotImplementedError("Subclass must provide implementation")

def state_isunknown(self):
"Return boolean indicating if job state retrieval failed because of the resource manager"
raise False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise False
return False

I'm not sure if you meant this to be return False or raise NotImplementedError(). I think it makes sense that this defaults to False, it doesn't need to be implemented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also seems to be not applied. And I got another idea, it should be None if not implemented. Equivalent to False for practical purposes in a boolean context, but if someone really wanted to know if it was implemented or not, then they can.

Suggested change
raise False
return None

return self.job_status and re.search(self.state_running_re, self.job_status)

def state_isunknown(self):
assert self.state_unknown_re, "Misconfigured: define state_unknown_re"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert self.state_unknown_re, "Misconfigured: define state_unknown_re"
if not self.state_unknown_re:
return False

I guess this doesn't need to be implemented for all spawners... it goes back to default behavior if it's not here.

@@ -467,20 +480,20 @@ class BatchSpawnerRegexStates(BatchSpawnerBase):
If this variable is set, the match object will be expanded using this string
to obtain the notebook IP.
See Python docs: re.match.expand""").tag(config=True)
state_unknown_re = Unicode('^$',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
state_unknown_re = Unicode('^$',
state_unknown_re = Unicode('',

As below, I guess this should be a false-like value if it's not set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the regex is an empty string, it will match any string and state_isunknown will always return True. The regex currently implemented will only match empty string, which I think is a good case to conclude the state of the job is unknown when querying the resource manager.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what I was missing. If it was an empty string, in my mind it should be considered unset, and then not checked and it is never "unknown". I think "unknown" should be an optional state, and if it's not set, it will simply never return true.

@@ -467,20 +480,20 @@ class BatchSpawnerRegexStates(BatchSpawnerBase):
If this variable is set, the match object will be expanded using this string
to obtain the notebook IP.
See Python docs: re.match.expand""").tag(config=True)
state_unknown_re = Unicode('^$',
help="Regex that matches job_status if the resource manager is not answering").tag(config=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
help="Regex that matches job_status if the resource manager is not answering").tag(config=True)
help="Regex that matches job_status if the resource manager is not answering. An empty string means 'not in use'.").tag(config=True)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^-- this is what empty string means

Copy link
Contributor

@rkdarst rkdarst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more ideas below. The semantics of the empty state_unknown_re are up for debate, but I think we should somehow allow it to be undefined. It could be None as empty value instead, but we'd need to see how to do that with traitlets.

@@ -467,20 +480,20 @@ class BatchSpawnerRegexStates(BatchSpawnerBase):
If this variable is set, the match object will be expanded using this string
to obtain the notebook IP.
See Python docs: re.match.expand""").tag(config=True)
state_unknown_re = Unicode('^$',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what I was missing. If it was an empty string, in my mind it should be considered unset, and then not checked and it is never "unknown". I think "unknown" should be an optional state, and if it's not set, it will simply never return true.

@@ -467,20 +480,20 @@ class BatchSpawnerRegexStates(BatchSpawnerBase):
If this variable is set, the match object will be expanded using this string
to obtain the notebook IP.
See Python docs: re.match.expand""").tag(config=True)
state_unknown_re = Unicode('^$',
help="Regex that matches job_status if the resource manager is not answering").tag(config=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^-- this is what empty string means

Comment on lines +495 to +497
def state_isunknown(self):
assert self.state_unknown_re, "Misconfigured: define state_unknown_re"
return self.job_status and re.search(self.state_unknown_re, self.job_status)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def state_isunknown(self):
assert self.state_unknown_re, "Misconfigured: define state_unknown_re"
return self.job_status and re.search(self.state_unknown_re, self.job_status)
def state_isunknown(self):
if self.state_unknown_re:
return self.job_status and re.search(self.state_unknown_re, self.job_status)

Changed so that if state_unknown_re is an empty string, it will always be false.

Does it need to self.job_status and here? Let's see, this would only matter if it's an empty string. In that case, I guess the regex hopefully wouldn't match anyway. BUT- this saves us from an exception if self.job_status is None. So I guess it's correct as-is. Anyone disagree?

self.job_status = out
self.job_status = await self.run_command(cmd)
except RuntimeError as e:
self.job_status = e.args[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this change doesn't appear, or am I misinterperting something?

@@ -326,22 +340,20 @@ def state_isrunning(self):
"Return boolean indicating if job is running, likely by parsing self.job_status"
raise NotImplementedError("Subclass must provide implementation")

def state_isunknown(self):
"Return boolean indicating if job state retrieval failed because of the resource manager"
raise False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also seems to be not applied. And I got another idea, it should be None if not implemented. Equivalent to False for practical purposes in a boolean context, but if someone really wanted to know if it was implemented or not, then they can.

Suggested change
raise False
return None

@rkdarst
Copy link
Contributor

rkdarst commented Jul 30, 2020

Since I wasn't sure if I should edit directly here, I'm working on this in another branch and will push as another PR. Overall works well, taking some care to make the tests pass (including understand what the expected JH behavior is)

rkdarst added a commit to rkdarst/batchspawner that referenced this pull request Jul 30, 2020
@mbmilligan
Copy link
Member

Since it looks like #187 supersedes this PR, I am closing this one. Someone please correct me if that's incorrect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants