computational nodes went down when job was submitted #36
Open
Description
The computational node will lose connection after some specific job (request 20 nodes) was assigned. The assigned job was rejected and re-queue many times and BatchHold in the end.
zcao@mu01:server_logs$ tracejob 72967 -n 4
Job: 72967.mu01
02/22/2020 05:53:50 S enqueuing into extended, state 1 hop 1
02/22/2020 05:53:50 A queue=extended
02/23/2020 00:01:32 S unable to run job, MOM rejected/timeout
02/23/2020 00:01:32 S unable to run job, send to MOM '11.11.11.2' failed
02/23/2020 00:01:33 S Job Run at request of root@mu01
02/23/2020 00:06:33 S unable to run job, send to MOM '11.11.11.3' failed
02/23/2020 00:11:34 S unable to run job, send to MOM '11.11.11.2' failed
02/23/2020 00:16:35 S unable to run job, send to MOM '11.11.11.3' failed
02/23/2020 00:21:36 S unable to run job, send to MOM '11.11.11.2' failed
02/23/2020 00:26:37 S unable to run job, send to MOM '11.11.11.3' failed
02/23/2020 00:31:38 S unable to run job, send to MOM '11.11.11.2' failed
02/23/2020 00:36:39 S unable to run job, send to MOM '11.11.11.3' failed
02/23/2020 00:41:40 S unable to run job, send to MOM '11.11.11.2' failed
02/23/2020 00:46:41 S unable to run job, send to MOM '11.11.11.3' failed
02/23/2020 00:51:42 S unable to run job, send to MOM '11.11.11.2' failed
02/23/2020 00:56:43 S unable to run job, send to MOM '11.11.11.3' failed
02/23/2020 01:01:44 S unable to run job, send to MOM '11.11.11.2' failed
02/23/2020 01:06:45 S unable to run job, send to MOM '11.11.11.3' failed
02/23/2020 01:11:46 S unable to run job, send to MOM '11.11.11.2' failed
02/23/2020 01:16:47 S unable to run job, send to MOM '11.11.11.3' failed
02/23/2020 01:21:48 S unable to run job, send to MOM '11.11.11.2' failed
02/23/2020 01:26:49 S unable to run job, send to MOM '11.11.11.3' failed
02/23/2020 01:31:50 S unable to run job, send to MOM '11.11.11.2' failed
02/23/2020 01:36:51 S unable to run job, send to MOM '11.11.11.3' failed
02/23/2020 01:41:52 S unable to run job, send to MOM '11.11.11.2' failed
02/23/2020 01:46:53 S unable to run job, send to MOM '11.11.11.3' failed
02/23/2020 01:51:54 S unable to run job, send to MOM '11.11.11.2' failed
02/23/2020 01:56:55 S unable to run job, send to MOM '11.11.11.3' failed
02/23/2020 02:01:56 S unable to run job, send to MOM '11.11.11.2' failed
02/25/2020 20:45:19 S Job deleted at request of zcao@mu01
02/25/2020 20:45:19 A requestor=zcao@mu01
02/25/2020 21:00:27 S on_job_exit valid pjob: 72967.mu01 (substate=59)
02/25/2020 21:15:29 S dequeuing from extended, state COMPLETE
similar issue happend on job 72823.mu01. Is it related to #8 ? Thanks!
Metadata
Assignees
Labels
No labels