pending allocations stuck in pending state after adoption by a new deployment #23305
Description
Nomad version
Nomad v1.7.7
BuildDate 2024-04-16T19:26:43Z
Revision 0f34c85ee63f6472bd2db1e2487611f4b176c70c
Operating system and Environment details
All Linux
Issue
- We have a large Nomad cluster- > 2k agents.
- We run health checks on each agent and do dynamic metadata updates.
- We have a large job that gets scheduled on 99% of the healthy agents.
- We use a shell script to query nomad for node counts matching a filter on relevant dynamic metadata & run
nomad job scale JOB TG COUNT
when it changes. - We have a way to identify if the currently running deployment is due to a new Docker Image, or due to a scaling activity.
- If it's due to a scaling activity on another task group, we cancel that deployment, and issue a new
nomad job scale
command immediately afterwards.- Due to how frequently the dynamic metadata is changing, we are constantly doing deployments.
- We kicked off 250 deployments in the last hour.
- Existing allocations usually all get adopted by the new deployment.
- The issues:
- most of the time, but not always, the version on the existing allocs that are still "Pending" at deployment cancelation are updated. but we end up with allocations that don't get the version updated.
- sometimes the allocation exists, but isn't assigned to a client in the UI. if you inspect the alloc with the Nomad CLI, it tells you which client it has been assigned to.
- allocations get stuck in the pending state. we've found sending a SIGQUIT to the Nomad Agent fixes this, but that doesn't scale to > 2k nodes.
- allocations will get rescheduled before they ever leave the pending state. YEAH.
- possibly related, but might not be- if a node heartbeat is missed and reregistered, we can end up with stuck pending jobs on a node as well. This is also fixed with a SIGQUIT.
- The stuck jobs aspect is brutal, since that doesn't start the health/deployment timers as far as I can tell.
- The fact restarting the nomad agent is a little unnerving. I vaguely recall seeing something in the logs about "allocations out of date" or something like that when this issue has manifested itself in the past.
Reproduction steps
- run a big cluster of nodes with intermittent connectivity issues to the master.
- run a big long running job on said cluster
- constantly be deploying/scaling the jobspec
- wait
Expected Result
- nomad schedules and runs jobs
Actual Result
- nomad doesn't
Screenshots of Absurdity
In this screenshot, all the allocs are part of the same job, but different task groups. The colours correspond to the actual task groups. Note the varying versions, and some having client and some not.
Additionally, you can see the Modified
time is the same for ones staying up to date, but isn't changing on other ones- you can also see the Created
times are all over the place.
In this screenshot, you can see we have a pending allocation that has been rescheduled, and that rescheduled allocation is marked pending as well. And neither of the allocations have been assigned to a client as far as the Nomad WebUI is concerned.
Reporter's speculation
Maybe it has something to do w/ how the allocations are being adopted by the rapid deployments? This definitely reeks of race condition.
Nomad logs
Useless, unfortunately, due to this bug: #22431
Metadata
Assignees
Type
Projects
Status
In Progress
Activity