Introduce unknown job status to handle communication problem with resource manager #179

cmd-ntrf · 2020-06-25T15:37:59Z

This is a draft PR to handle the case where the resource manager does not return a status that is valid for BatchSpawner, but the error message returned indicate a problem with the resource manager and not the job.

This happens quite often with Slurm where squeue would return slurm_load_jobs error: Socket timed out on send/recv. The notebook job is still running fine, squeue was just not able to query its state. Currently, batchspawner will clear the state if it cannot query it, and the job will be subsequently cancelled, which is inconvenient. This problem has been exposed in #171 and #178.

The proposed solution is to refactor the job status querying to return a JobStatus object instead of the job_status string. We introduce four states: NOTQUEUED, RUNNING, PENDING, and UNKNOWN. Because the function deal with job status and not state, the method read_job_state is renamed query_job_status. This also makes the method names more consistent with the documentation.

Instead of calling the job_is* function, query_status_job returns a JobStatus object which can be used to determine what should be done next. The unknown status is determined by state_isunknown function, and for now only SlurmSpawner implements a regex to handle the cases where the query command exit code is 1.

Avoid confusion with the *_state methods

Calling poll meant we would be running state_isrunning and state_ispending twice.

rkdarst

I think this is a good idea. Even if it wasn't needed, I think it simplifies some logic I had wanted to simplify anyway. There are some miscellaneous suggestions which I added line-by-line.

Anyway, I would recommend accepting, we can remove the TMPFAIL state later if there turns out to be a better way.

Other suggestions:

allow some number of TMPFAILs before it actually returns NOTPRESENT.
We could ask Min if a "allow n failures" could be added to jupyterhub itself. I think that spawner.py:start_polling is the place to look to do it.
It would be nice to add some generic caching layer, but do you think anyone would do it in a quick timeframe? I doubt I would, this seems like it works well enough for now for what I need...

I'll work to make tests pass if @cmd-ntrf accepts or rejects the suggestions below.

Thanks for the good work!

batchspawner/batchspawner.py

rkdarst · 2020-07-01T21:30:21Z

batchspawner/batchspawner.py

-            self.job_status = out
+            self.job_status = await self.run_command(cmd)
+        except RuntimeError as e:
+            self.job_status = e.args[0]


Suggested change

self.job_status = e.args[0]

# e.args[0] is stderr from the process

self.job_status = e.args[0]

It seems this change doesn't appear, or am I misinterperting something?

batchspawner/batchspawner.py

rkdarst · 2020-07-01T21:33:51Z

batchspawner/batchspawner.py

@@ -326,22 +340,20 @@ def state_isrunning(self):
        "Return boolean indicating if job is running, likely by parsing self.job_status"
        raise NotImplementedError("Subclass must provide implementation")

+    def state_isunknown(self):
+        "Return boolean indicating if job state retrieval failed because of the resource manager"
+        raise False


Suggested change

raise False

return False

I'm not sure if you meant this to be return False or raise NotImplementedError(). I think it makes sense that this defaults to False, it doesn't need to be implemented.

This also seems to be not applied. And I got another idea, it should be None if not implemented. Equivalent to False for practical purposes in a boolean context, but if someone really wanted to know if it was implemented or not, then they can.

Suggested change

raise False

return None

rkdarst · 2020-07-01T22:02:16Z

batchspawner/batchspawner.py

+        return self.job_status and re.search(self.state_running_re, self.job_status)
+
+    def state_isunknown(self):
+        assert self.state_unknown_re, "Misconfigured: define state_unknown_re"


Suggested change

assert self.state_unknown_re, "Misconfigured: define state_unknown_re"

if not self.state_unknown_re:

return False

I guess this doesn't need to be implemented for all spawners... it goes back to default behavior if it's not here.

rkdarst · 2020-07-01T22:03:34Z

batchspawner/batchspawner.py

@@ -467,20 +480,20 @@ class BatchSpawnerRegexStates(BatchSpawnerBase):
        If this variable is set, the match object will be expanded using this string
        to obtain the notebook IP.
        See Python docs: re.match.expand""").tag(config=True)
+    state_unknown_re = Unicode('^$',


Suggested change

state_unknown_re = Unicode('^$',

state_unknown_re = Unicode('',

As below, I guess this should be a false-like value if it's not set.

If the regex is an empty string, it will match any string and state_isunknown will always return True. The regex currently implemented will only match empty string, which I think is a good case to conclude the state of the job is unknown when querying the resource manager.

Ah, I see what I was missing. If it was an empty string, in my mind it should be considered unset, and then not checked and it is never "unknown". I think "unknown" should be an optional state, and if it's not set, it will simply never return true.

rkdarst · 2020-07-01T22:03:49Z

batchspawner/batchspawner.py

@@ -467,20 +480,20 @@ class BatchSpawnerRegexStates(BatchSpawnerBase):
        If this variable is set, the match object will be expanded using this string
        to obtain the notebook IP.
        See Python docs: re.match.expand""").tag(config=True)
+    state_unknown_re = Unicode('^$',
+        help="Regex that matches job_status if the resource manager is not answering").tag(config=True)


Suggested change

help="Regex that matches job_status if the resource manager is not answering").tag(config=True)

help="Regex that matches job_status if the resource manager is not answering. An empty string means 'not in use'.").tag(config=True)

^-- this is what empty string means

rkdarst

Some more ideas below. The semantics of the empty state_unknown_re are up for debate, but I think we should somehow allow it to be undefined. It could be None as empty value instead, but we'd need to see how to do that with traitlets.

rkdarst · 2020-07-22T19:26:04Z

batchspawner/batchspawner.py

@@ -467,20 +480,20 @@ class BatchSpawnerRegexStates(BatchSpawnerBase):
        If this variable is set, the match object will be expanded using this string
        to obtain the notebook IP.
        See Python docs: re.match.expand""").tag(config=True)
+    state_unknown_re = Unicode('^$',


Ah, I see what I was missing. If it was an empty string, in my mind it should be considered unset, and then not checked and it is never "unknown". I think "unknown" should be an optional state, and if it's not set, it will simply never return true.

rkdarst · 2020-07-22T19:26:30Z

batchspawner/batchspawner.py

@@ -467,20 +480,20 @@ class BatchSpawnerRegexStates(BatchSpawnerBase):
        If this variable is set, the match object will be expanded using this string
        to obtain the notebook IP.
        See Python docs: re.match.expand""").tag(config=True)
+    state_unknown_re = Unicode('^$',
+        help="Regex that matches job_status if the resource manager is not answering").tag(config=True)


^-- this is what empty string means

rkdarst · 2020-07-22T19:31:11Z

batchspawner/batchspawner.py

+    def state_isunknown(self):
+        assert self.state_unknown_re, "Misconfigured: define state_unknown_re"
+        return self.job_status and re.search(self.state_unknown_re, self.job_status)


Suggested change

def state_isunknown(self):

assert self.state_unknown_re, "Misconfigured: define state_unknown_re"

return self.job_status and re.search(self.state_unknown_re, self.job_status)

def state_isunknown(self):

if self.state_unknown_re:

return self.job_status and re.search(self.state_unknown_re, self.job_status)

Changed so that if state_unknown_re is an empty string, it will always be false.

Does it need to self.job_status and here? Let's see, this would only matter if it's an empty string. In that case, I guess the regex hopefully wouldn't match anyway. BUT- this saves us from an exception if self.job_status is None. So I guess it's correct as-is. Anyone disagree?

rkdarst · 2020-07-22T19:32:56Z

batchspawner/batchspawner.py

-            self.job_status = out
+            self.job_status = await self.run_command(cmd)
+        except RuntimeError as e:
+            self.job_status = e.args[0]


It seems this change doesn't appear, or am I misinterperting something?

rkdarst · 2020-07-22T19:34:21Z

batchspawner/batchspawner.py

@@ -326,22 +340,20 @@ def state_isrunning(self):
        "Return boolean indicating if job is running, likely by parsing self.job_status"
        raise NotImplementedError("Subclass must provide implementation")

+    def state_isunknown(self):
+        "Return boolean indicating if job state retrieval failed because of the resource manager"
+        raise False


This also seems to be not applied. And I got another idea, it should be None if not implemented. Equivalent to False for practical purposes in a boolean context, but if someone really wanted to know if it was implemented or not, then they can.

Suggested change

raise False

return None

rkdarst · 2020-07-30T14:20:39Z

Since I wasn't sure if I should edit directly here, I'm working on this in another branch and will push as another PR. Overall works well, taking some care to make the tests pass (including understand what the expected JH behavior is)

mbmilligan · 2020-08-07T22:55:47Z

Since it looks like #187 supersedes this PR, I am closing this one. Someone please correct me if that's incorrect.

cmd-ntrf added 6 commits June 23, 2020 12:41

Add logic to handle cases where the rm is not answering

b107e47

Extend slurm unknown regex

6fbb5b6

Add check to state_isunknown in stop

dd5ac3f

Rename read_job_state as query_job_status

afa666b

Avoid confusion with the *_state methods

Replace poll by query_job_status

ff9ab2a

Calling poll meant we would be running state_isrunning and state_ispending twice.

Define JobStatus and make query_job_status return a JobStatus

68be363

cmd-ntrf mentioned this pull request Jun 25, 2020

When the Scheduler/RM fails? #171

Closed

rkdarst approved these changes Jul 1, 2020

View reviewed changes

cmd-ntrf added 2 commits July 8, 2020 09:43

Rename NOTQUEUED as NOTFOUND

8aa71b4

Add docstring to query_job_status

fc522b6

rkdarst suggested changes Jul 22, 2020

View reviewed changes

This was referenced Jul 23, 2020

Batchspawner spawning / keep-alive is instable #174

Closed

PR status #182

Open

PR status #183

Closed

rkdarst added a commit to rkdarst/batchspawner that referenced this pull request Jul 30, 2020

batchspawner/batchspawner: Fixups of jupyterhub#179

e6c0a3e

rkdarst mentioned this pull request Jul 30, 2020

unknown job state: Improvements of #179 to make tests pass #187

Merged

mbmilligan closed this Aug 7, 2020

rcthomas mentioned this pull request Sep 13, 2020

Could Spawner.poll() have an "unknown" status? jupyterhub/jupyterhub#3171

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce unknown job status to handle communication problem with resource manager #179

Introduce unknown job status to handle communication problem with resource manager #179

cmd-ntrf commented Jun 25, 2020

rkdarst left a comment

rkdarst Jul 1, 2020

cmd-ntrf Jul 8, 2020

rkdarst Jul 22, 2020

rkdarst Jul 1, 2020

cmd-ntrf Jul 8, 2020

rkdarst Jul 22, 2020

rkdarst Jul 1, 2020

rkdarst Jul 1, 2020

cmd-ntrf Jul 8, 2020

rkdarst Jul 22, 2020

rkdarst Jul 1, 2020

rkdarst Jul 22, 2020

rkdarst left a comment

rkdarst Jul 22, 2020

rkdarst Jul 22, 2020

rkdarst Jul 22, 2020

rkdarst Jul 22, 2020

rkdarst Jul 22, 2020

rkdarst commented Jul 30, 2020

mbmilligan commented Aug 7, 2020

	self.job_status = e.args[0]
	# e.args[0] is stderr from the process
	self.job_status = e.args[0]

	assert self.state_unknown_re, "Misconfigured: define state_unknown_re"
	if not self.state_unknown_re:
	return False

	state_unknown_re = Unicode('^$',
	state_unknown_re = Unicode('',

	help="Regex that matches job_status if the resource manager is not answering").tag(config=True)
	help="Regex that matches job_status if the resource manager is not answering. An empty string means 'not in use'.").tag(config=True)

Introduce unknown job status to handle communication problem with resource manager #179

Introduce unknown job status to handle communication problem with resource manager #179

Conversation

cmd-ntrf commented Jun 25, 2020

rkdarst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkdarst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkdarst commented Jul 30, 2020

mbmilligan commented Aug 7, 2020