Skip to content

Zombie alloc after client restart #23684

Open
@bernardoVale

Description

Nomad version

Nomad v1.6.1
BuildDate 2023-07-21T13:49:42Z
Revision 515895c7690cdc72278018dc5dc58aca41204ccc

Operating system and Environment details

PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/

Issue

We've seen this issue a couple of times. It manifests as placement errors during deployments of services that must run on specific hosts. The deploy can't proceed because it must place the allocation in one host and that host still thinks one old allocation exists.

If an operator tries to kill the alloc in the UI (or in the cli), we see it briefly transition from RUNNING to COMPLETE and then back to RUNNING again.

Restarting the client doesn't help as well. The only fix we've found is by deleting nomad-client state:

sudo systemctl stop nomad-client.service
sudo mkdir /data/service/nomad-client/data/client/bkp
sudo mv /data/service/nomad-client/data/client/state.db.backup /data/service/nomad-client/data/client/bkp/
sudo mv /data/service/nomad-client/data/client/state.db /data/service/nomad-client/data/client/bkp/
sudo systemctl start nomad-client

On the last two occasions, I noticed a pattern. A nomad-client restart log roughly at the same time as the kill alloc log. So it seems that client fails to persist the state that the alloc is dead and gc-ed and then it starts the server (or client, I don't know how it works) to think the allocation still exists.

I attached the logs from the client, the allocation id is 7e7d1649-3ecf-a1b4-b258-157944a59831

Reproduction steps

It's hard to reproduce, I couldn't, but my best guess would be to:

  1. send a signal to restart/stop alloc
  2. immediately try to restart nomad client

Expected Result

Alloc state is COMPLETE

Actual Result

Alloc still exists and can't be killed

Nomad Server logs (if appropriate)

Let me know if it's relevant

Nomad Client logs (if appropriate)

logs

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    • Status

      Needs Roadmapping

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions