Description
Nomad version
Nomad v1.6.1
BuildDate 2023-07-21T13:49:42Z
Revision 515895c7690cdc72278018dc5dc58aca41204ccc
Operating system and Environment details
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/
Issue
We've seen this issue a couple of times. It manifests as placement errors during deployments of services that must run on specific hosts. The deploy can't proceed because it must place the allocation in one host and that host still thinks one old allocation exists.
If an operator tries to kill the alloc in the UI (or in the cli), we see it briefly transition from RUNNING
to COMPLETE
and then back to RUNNING
again.
Restarting the client doesn't help as well. The only fix we've found is by deleting nomad-client state:
sudo systemctl stop nomad-client.service
sudo mkdir /data/service/nomad-client/data/client/bkp
sudo mv /data/service/nomad-client/data/client/state.db.backup /data/service/nomad-client/data/client/bkp/
sudo mv /data/service/nomad-client/data/client/state.db /data/service/nomad-client/data/client/bkp/
sudo systemctl start nomad-client
On the last two occasions, I noticed a pattern. A nomad-client restart log roughly at the same time as the kill alloc log. So it seems that client fails to persist the state that the alloc is dead and gc-ed and then it starts the server (or client, I don't know how it works) to think the allocation still exists.
I attached the logs from the client, the allocation id is 7e7d1649-3ecf-a1b4-b258-157944a59831
Reproduction steps
It's hard to reproduce, I couldn't, but my best guess would be to:
- send a signal to restart/stop alloc
- immediately try to restart nomad client
Expected Result
Alloc state is COMPLETE
Actual Result
Alloc still exists and can't be killed
Nomad Server logs (if appropriate)
Let me know if it's relevant
Nomad Client logs (if appropriate)
Metadata
Assignees
Type
Projects
Status
Needs Roadmapping
Activity