Skip to content

Commit

Permalink
SGE: do not consider u(nknown) nodes as busy in jobwatcher
Browse files Browse the repository at this point in the history
* Modify SGE_BUSY_STATES to not consider a node with u(nknown) state as busy in order to avoid overscaling issue when sge process is temporarily unresponsive when hitting bottleneck on network, etc in a large scale setting
* nodewatcher should already take care of replacing a node if it is consistently in u state
* Modified unit test to adapt to above changes

Signed-off-by: Rex <shuningc@amazon.com>
  • Loading branch information
rexcsn committed Apr 3, 2020
1 parent bf66971 commit 96ceb05
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 2 deletions.
6 changes: 5 additions & 1 deletion src/common/schedulers/sge_commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,8 +55,12 @@
# S(ubordinate), d(isabled), D(isabled), E(rror), c(configuration ambiguous), o(rphaned), P(reempted),
# or some combination thereof.
# Refer to qstat man page for additional details.

# u(nknown) is not considered as busy since the node will eventually be replaced by nodewatcher.
# Otherwise there might be overscaling issue when sge process is temporarily unresponsive
# when hitting bottleneck on network, etc in a large scale setting.
# o(rphaned) is not considered as busy since we assume a node in orphaned state is not present in ASG anymore
SGE_BUSY_STATES = ["u", "C", "s", "D", "E", "P"]
SGE_BUSY_STATES = ["C", "s", "D", "E", "P"]

# This state is set by nodewatcher when the node is locked and is being terminated.
SGE_DISABLED_STATE = "d"
Expand Down
2 changes: 1 addition & 1 deletion tests/jobwatcher/plugins/test_sge.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@
jobs=[],
)
},
1,
0,
),
(
{
Expand Down

0 comments on commit 96ceb05

Please sign in to comment.