pending allocations stuck in pending state after adoption by a new deployment

### Nomad version
```
Nomad v1.7.7
BuildDate 2024-04-16T19:26:43Z
Revision 0f34c85ee63f6472bd2db1e2487611f4b176c70c
```

### Operating system and Environment details
All Linux

### Issue
* We have a *large* Nomad cluster- > 2k agents.
* We run health checks on each agent and do dynamic metadata updates.
* We have a large job that gets scheduled on 99% of the healthy agents.
* We use a shell script to query nomad for node counts matching a filter on relevant dynamic metadata & run `nomad job scale JOB TG COUNT` when it changes.
* We have a way to identify if the currently running deployment is due to a new Docker Image, or due to a scaling activity.
* If it's due to a scaling activity on another task group, we *cancel* that deployment, and issue a new `nomad job scale` command immediately afterwards.
    * Due to how frequently the dynamic metadata is changing, we are *_constantly_* doing deployments.
    * We kicked off 250 deployments in the last hour.
* Existing allocations _*usually*_ all get adopted by the new deployment.
* The issues:
    * most of the time, but not always, the version on the existing allocs that are still "Pending" at deployment cancelation are updated. but we end up with allocations that don't get the version updated.
    * sometimes the allocation exists, but isn't assigned to a client in the UI. if you inspect the alloc with the Nomad CLI, it tells you which client it has been assigned to.
    * allocations get stuck in the pending state. we've found sending a SIGQUIT to the Nomad Agent fixes this, but that doesn't scale to > 2k nodes.
    * allocations will get *_rescheduled_* before they ever leave the pending state. _YEAH_.
    * possibly related, but might not be- if a node heartbeat is missed and reregistered, we can end up with stuck pending jobs on a node as well. This is also fixed with a SIGQUIT.
        * The stuck jobs aspect is brutal, since that doesn't start the health/deployment timers as far as I can tell.
        * The fact restarting the nomad agent is a little unnerving. I vaguely recall seeing something in the logs about "allocations out of date" or something like that when this issue has manifested itself in the past.

### Reproduction steps
* run a big cluster of nodes with intermittent connectivity issues to the master.
* run a big long running job on said cluster
* constantly be deploying/scaling the jobspec
* wait

#### Expected Result
* nomad schedules and runs jobs

#### Actual Result
* nomad doesn't

#### Screenshots of Absurdity
In this screenshot, all the allocs are part of the same job, but different task groups. The colours correspond to the actual task groups. Note the varying versions, and some having client and some not.

Additionally, you can see the `Modified` time is the same for ones staying up to date, but isn't changing on other ones- you can also see the `Created` times are all over the place.

<img width="778" alt="Screenshot 2024-06-11 at 3 35 54 PM" src="https://github.com/hashicorp/nomad/assets/6198952/97393961-fdc0-4e2e-9441-9a5da1ef4412">

In this screenshot, you can see we have a pending allocation that has been rescheduled, and that rescheduled allocation is marked pending as well. And *_neither_* of the allocations have been assigned to a client as far as the Nomad WebUI is concerned.

<img width="453" alt="Screenshot 2024-06-11 at 4 52 00 PM" src="https://github.com/hashicorp/nomad/assets/6198952/5d3eaf09-d25b-47cb-9c27-dd40bb3abe16">

### Reporter's speculation

Maybe it has something to do w/ how the allocations are being adopted by the rapid deployments? This definitely reeks of race condition.

### Nomad logs
Useless, unfortunately, due to this bug: https://github.com/hashicorp/nomad/issues/22431

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pending allocations stuck in pending state after adoption by a new deployment #23305

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Screenshots of Absurdity

Reporter's speculation

Nomad logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development