-
Notifications
You must be signed in to change notification settings - Fork 186
Description
Hey folks, I have spent a fair bit of time debugging this and I believe there is a problem with the code introduced in fb05da3 related to stuck launching tasks. It appears to continually request reconciliation for all tasks in the system after they have been running for a short while.
Here's my setup I validated this on:
- Local Docker stack
- One agent
- One master
- One ZK node
- Singularity 1.2.0 and current
master- same behavior
Steps:
- Deploy nginx with
OPTIMISTICplacement strategy - Wait a couple of minutes
- See it start to log
Requested explicit reconcile of task ...every 5 seconds when theSchedulerPollerruns. It's calling the Mesos master for every one of these.
Logs
singularity_1 | INFO [2020-07-23 13:57:03,437] com.hubspot.singularity.scheduler.SingularityScheduler: Requested explicit reconcile of task dev_nginx-latest_6dc0b72-1595512077034-1-docker1-DEFAULT
singularity_1 | INFO [2020-07-23 13:57:03,437] com.hubspot.singularity.mesos.SingularityMesosOfferScheduler: Received 0 offer(s)
singularity_1 | INFO [2020-07-23 13:57:03,444] com.hubspot.singularity.mesos.SingularityMesosOfferScheduler: 0 remaining offers not accounted for in offer check
singularity_1 | INFO [2020-07-23 13:57:03,444] com.hubspot.singularity.mesos.SingularityMesosOfferScheduler: Finished handling 0 new offer(s) 0 from cache (00:00.007), 0 accepted, 0 declined/cached
singularity_1 | INFO [2020-07-23 13:57:08,447] com.hubspot.singularity.scheduler.SingularityScheduler: Requested explicit reconcile of task dev_nginx-latest_6dc0b72-1595512077034-1-docker1-DEFAULT
singularity_1 | INFO [2020-07-23 13:57:08,447] com.hubspot.singularity.mesos.SingularityMesosOfferScheduler: Received 0 offer(s)
singularity_1 | INFO [2020-07-23 13:57:08,455] com.hubspot.singularity.mesos.SingularityMesosOfferScheduler: 0 remaining offers not accounted for in offer check
singularity_1 | INFO [2020-07-23 13:57:08,458] com.hubspot.singularity.mesos.SingularityMesosOfferScheduler: Finished handling 0 new offer(s) 0 from cache (00:00.011), 0 accepted, 0 declined/cached
Possible Cause
Looking at the code it appears to me that
Singularity/SingularityService/src/main/java/com/hubspot/singularity/data/TaskManager.java
Lines 892 to 897 in 4f7a41d
| public List<SingularityTaskId> getLaunchingTasks() { | |
| return getActiveTaskIds() | |
| .stream() | |
| .filter(t -> !exists(getUpdatePath(t, ExtendedTaskState.TASK_STARTING))) | |
| .collect(Collectors.toList()); |
/api/state I see the following:
"activeTasks":1,
"launchingTasks":0,
"activeRequests":1
And yet it continues to run the reconciliation. I validated in Zookeeper what is in the history for my nginx task above. It only has two entries:
[zk: localhost:2181(CONNECTED) 3] ls /singularity/tasks/history/dev_nginx/dev_nginx-latest_6dc0b72-1595512077034-1-docker1-DEFAULT/updates
[TASK_LAUNCHED, TASK_RUNNING]
So I am pretty sure it's the code above that is the culprit. If I rebuild current master branch without the ! in front of exists(getUpdatePath(t, ExtendedTaskState.TASK_STARTING))) I no longer see the issue.