Skip to content

pending allocations stuck in pending state after adoption by a new deployment #23305

Open
hashicorp/yamux
#127
@lattwood

Description

Nomad version

Nomad v1.7.7
BuildDate 2024-04-16T19:26:43Z
Revision 0f34c85ee63f6472bd2db1e2487611f4b176c70c

Operating system and Environment details

All Linux

Issue

  • We have a large Nomad cluster- > 2k agents.
  • We run health checks on each agent and do dynamic metadata updates.
  • We have a large job that gets scheduled on 99% of the healthy agents.
  • We use a shell script to query nomad for node counts matching a filter on relevant dynamic metadata & run nomad job scale JOB TG COUNT when it changes.
  • We have a way to identify if the currently running deployment is due to a new Docker Image, or due to a scaling activity.
  • If it's due to a scaling activity on another task group, we cancel that deployment, and issue a new nomad job scale command immediately afterwards.
    • Due to how frequently the dynamic metadata is changing, we are constantly doing deployments.
    • We kicked off 250 deployments in the last hour.
  • Existing allocations usually all get adopted by the new deployment.
  • The issues:
    • most of the time, but not always, the version on the existing allocs that are still "Pending" at deployment cancelation are updated. but we end up with allocations that don't get the version updated.
    • sometimes the allocation exists, but isn't assigned to a client in the UI. if you inspect the alloc with the Nomad CLI, it tells you which client it has been assigned to.
    • allocations get stuck in the pending state. we've found sending a SIGQUIT to the Nomad Agent fixes this, but that doesn't scale to > 2k nodes.
    • allocations will get rescheduled before they ever leave the pending state. YEAH.
    • possibly related, but might not be- if a node heartbeat is missed and reregistered, we can end up with stuck pending jobs on a node as well. This is also fixed with a SIGQUIT.
      • The stuck jobs aspect is brutal, since that doesn't start the health/deployment timers as far as I can tell.
      • The fact restarting the nomad agent is a little unnerving. I vaguely recall seeing something in the logs about "allocations out of date" or something like that when this issue has manifested itself in the past.

Reproduction steps

  • run a big cluster of nodes with intermittent connectivity issues to the master.
  • run a big long running job on said cluster
  • constantly be deploying/scaling the jobspec
  • wait

Expected Result

  • nomad schedules and runs jobs

Actual Result

  • nomad doesn't

Screenshots of Absurdity

In this screenshot, all the allocs are part of the same job, but different task groups. The colours correspond to the actual task groups. Note the varying versions, and some having client and some not.

Additionally, you can see the Modified time is the same for ones staying up to date, but isn't changing on other ones- you can also see the Created times are all over the place.

Screenshot 2024-06-11 at 3 35 54 PM

In this screenshot, you can see we have a pending allocation that has been rescheduled, and that rescheduled allocation is marked pending as well. And neither of the allocations have been assigned to a client as far as the Nomad WebUI is concerned.

Screenshot 2024-06-11 at 4 52 00 PM

Reporter's speculation

Maybe it has something to do w/ how the allocations are being adopted by the rapid deployments? This definitely reeks of race condition.

Nomad logs

Useless, unfortunately, due to this bug: #22431

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions