-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce unknown job status to handle communication problem with resource manager #179
Conversation
Avoid confusion with the *_state methods
Calling poll meant we would be running state_isrunning and state_ispending twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good idea. Even if it wasn't needed, I think it simplifies some logic I had wanted to simplify anyway. There are some miscellaneous suggestions which I added line-by-line.
Anyway, I would recommend accepting, we can remove the TMPFAIL state later if there turns out to be a better way.
Other suggestions:
- allow some number of TMPFAILs before it actually returns NOTPRESENT.
- We could ask Min if a "allow n failures" could be added to jupyterhub itself. I think that spawner.py:start_polling is the place to look to do it.
- It would be nice to add some generic caching layer, but do you think anyone would do it in a quick timeframe? I doubt I would, this seems like it works well enough for now for what I need...
I'll work to make tests pass if @cmd-ntrf accepts or rejects the suggestions below.
Thanks for the good work!
self.job_status = out | ||
self.job_status = await self.run_command(cmd) | ||
except RuntimeError as e: | ||
self.job_status = e.args[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.job_status = e.args[0] | |
# e.args[0] is stderr from the process | |
self.job_status = e.args[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this change doesn't appear, or am I misinterperting something?
@@ -326,22 +340,20 @@ def state_isrunning(self): | |||
"Return boolean indicating if job is running, likely by parsing self.job_status" | |||
raise NotImplementedError("Subclass must provide implementation") | |||
|
|||
def state_isunknown(self): | |||
"Return boolean indicating if job state retrieval failed because of the resource manager" | |||
raise False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise False | |
return False |
I'm not sure if you meant this to be return False or raise NotImplementedError(). I think it makes sense that this defaults to False, it doesn't need to be implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also seems to be not applied. And I got another idea, it should be None
if not implemented. Equivalent to False for practical purposes in a boolean context, but if someone really wanted to know if it was implemented or not, then they can.
raise False | |
return None |
return self.job_status and re.search(self.state_running_re, self.job_status) | ||
|
||
def state_isunknown(self): | ||
assert self.state_unknown_re, "Misconfigured: define state_unknown_re" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert self.state_unknown_re, "Misconfigured: define state_unknown_re" | |
if not self.state_unknown_re: | |
return False |
I guess this doesn't need to be implemented for all spawners... it goes back to default behavior if it's not here.
@@ -467,20 +480,20 @@ class BatchSpawnerRegexStates(BatchSpawnerBase): | |||
If this variable is set, the match object will be expanded using this string | |||
to obtain the notebook IP. | |||
See Python docs: re.match.expand""").tag(config=True) | |||
state_unknown_re = Unicode('^$', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
state_unknown_re = Unicode('^$', | |
state_unknown_re = Unicode('', |
As below, I guess this should be a false-like value if it's not set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the regex is an empty string, it will match any string and state_isunknown
will always return True
. The regex currently implemented will only match empty string, which I think is a good case to conclude the state of the job is unknown when querying the resource manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see what I was missing. If it was an empty string, in my mind it should be considered unset, and then not checked and it is never "unknown". I think "unknown" should be an optional state, and if it's not set, it will simply never return true.
@@ -467,20 +480,20 @@ class BatchSpawnerRegexStates(BatchSpawnerBase): | |||
If this variable is set, the match object will be expanded using this string | |||
to obtain the notebook IP. | |||
See Python docs: re.match.expand""").tag(config=True) | |||
state_unknown_re = Unicode('^$', | |||
help="Regex that matches job_status if the resource manager is not answering").tag(config=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
help="Regex that matches job_status if the resource manager is not answering").tag(config=True) | |
help="Regex that matches job_status if the resource manager is not answering. An empty string means 'not in use'.").tag(config=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^-- this is what empty string means
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more ideas below. The semantics of the empty state_unknown_re are up for debate, but I think we should somehow allow it to be undefined. It could be None
as empty value instead, but we'd need to see how to do that with traitlets.
@@ -467,20 +480,20 @@ class BatchSpawnerRegexStates(BatchSpawnerBase): | |||
If this variable is set, the match object will be expanded using this string | |||
to obtain the notebook IP. | |||
See Python docs: re.match.expand""").tag(config=True) | |||
state_unknown_re = Unicode('^$', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see what I was missing. If it was an empty string, in my mind it should be considered unset, and then not checked and it is never "unknown". I think "unknown" should be an optional state, and if it's not set, it will simply never return true.
@@ -467,20 +480,20 @@ class BatchSpawnerRegexStates(BatchSpawnerBase): | |||
If this variable is set, the match object will be expanded using this string | |||
to obtain the notebook IP. | |||
See Python docs: re.match.expand""").tag(config=True) | |||
state_unknown_re = Unicode('^$', | |||
help="Regex that matches job_status if the resource manager is not answering").tag(config=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^-- this is what empty string means
def state_isunknown(self): | ||
assert self.state_unknown_re, "Misconfigured: define state_unknown_re" | ||
return self.job_status and re.search(self.state_unknown_re, self.job_status) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def state_isunknown(self): | |
assert self.state_unknown_re, "Misconfigured: define state_unknown_re" | |
return self.job_status and re.search(self.state_unknown_re, self.job_status) | |
def state_isunknown(self): | |
if self.state_unknown_re: | |
return self.job_status and re.search(self.state_unknown_re, self.job_status) |
Changed so that if state_unknown_re
is an empty string, it will always be false.
Does it need to self.job_status and
here? Let's see, this would only matter if it's an empty string. In that case, I guess the regex hopefully wouldn't match anyway. BUT- this saves us from an exception if self.job_status is None
. So I guess it's correct as-is. Anyone disagree?
self.job_status = out | ||
self.job_status = await self.run_command(cmd) | ||
except RuntimeError as e: | ||
self.job_status = e.args[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this change doesn't appear, or am I misinterperting something?
@@ -326,22 +340,20 @@ def state_isrunning(self): | |||
"Return boolean indicating if job is running, likely by parsing self.job_status" | |||
raise NotImplementedError("Subclass must provide implementation") | |||
|
|||
def state_isunknown(self): | |||
"Return boolean indicating if job state retrieval failed because of the resource manager" | |||
raise False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also seems to be not applied. And I got another idea, it should be None
if not implemented. Equivalent to False for practical purposes in a boolean context, but if someone really wanted to know if it was implemented or not, then they can.
raise False | |
return None |
Since I wasn't sure if I should edit directly here, I'm working on this in another branch and will push as another PR. Overall works well, taking some care to make the tests pass (including understand what the expected JH behavior is) |
Since it looks like #187 supersedes this PR, I am closing this one. Someone please correct me if that's incorrect. |
This is a draft PR to handle the case where the resource manager does not return a status that is valid for BatchSpawner, but the error message returned indicate a problem with the resource manager and not the job.
This happens quite often with Slurm where squeue would return
slurm_load_jobs error: Socket timed out on send/recv
. The notebook job is still running fine, squeue was just not able to query its state. Currently, batchspawner will clear the state if it cannot query it, and the job will be subsequently cancelled, which is inconvenient. This problem has been exposed in #171 and #178.The proposed solution is to refactor the job status querying to return a JobStatus object instead of the job_status string. We introduce four states: NOTQUEUED, RUNNING, PENDING, and UNKNOWN. Because the function deal with job status and not state, the method
read_job_state
is renamedquery_job_status
. This also makes the method names more consistent with the documentation.Instead of calling the job_is* function, query_status_job returns a JobStatus object which can be used to determine what should be done next. The unknown status is determined by state_isunknown function, and for now only SlurmSpawner implements a regex to handle the cases where the query command exit code is 1.