Skip to content

lost allocation drops reschedule tracker #24918

Open
@tgross

Description

In #12319 we fixed a very old bug where when an allocation failed the scheduler failed to find a placement, the reschedule tracker was dropped. While working with @pkazmierczak on #24869 we discovered this bug was not 100% fixed. In case where the node is down and the allocation is marked lost, we're somehow not propagating the reschedule tracker.

Reproduction

To demonstrate both the behavior that works and the non-working behavior, I'm deploying to a 1 server + 1 client cluster (current tip of main aka 1.9.6-dev), with the following jobspec. This jobspec has disabled restarts and a constraint block that allows us to control whether or not placement works.

jobspec
job "example" {

  group "group" {

    reschedule {
      attempts  = 30
      interval  = "24h"
      unlimited = false
    }

    restart {
      attempts = 0
      mode     = "fail"
    }

    constraint {
      attribute = "${meta.example}"
      operator  = "="
      value     = "1"
    }

    task "task" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-vv", "-f", "-p", "8001", "-h", "/local"]
      }

      resources {
        cpu    = 100
        memory = 100
      }

    }
  }
}

Apply the following node metadata to the node:

$ nomad node status
ID        Node Pool  DC        Name     Class      Drain  Eligibility  Status
e6e43a5a  default    philly-1  client0  multipass  false  eligible     ready

$ nomad node meta apply --node-id e6e43a5a example=1

Run the job.

Normal Recheduling

Kill the task (via docker kill) to force a reschedule.

$ nomad alloc status 4d64f58c
...
Recent Events:
Time                       Type            Description
2025-01-22T15:13:20-05:00  Not Restarting  Policy allows no restarts

Wait for the allocation to be rescheduled and see that the replacement has a reschedule tracker.

$ nomad job status example
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
1914d5a9  e6e43a5a  group       0        run      running  3s ago     2s ago
4d64f58c  e6e43a5a  group       0        stop     failed   1m14s ago  3s ag

$ nomad operator api "/v1/allocation/1914d5a9-3610-75a9-025d-729a9dbed06c" | jq .RescheduleTracker
{
  "Events": [
    {
      "Delay": 30000000000,
      "PrevAllocID": "4d64f58c-96cc-8465-82ba-e48241dbdba6",
      "PrevNodeID": "e6e43a5a-9ddb-d65a-521a-cde19f093656",
      "RescheduleTime": 1737576830218453000
    }
  ],
  "LastReschedule": "ok"
}

Failed Rescheduling with Correct Behavior

Now we'll change the node metadata to make the node ineligible:

$ nomad node meta apply --node-id e6e43a5a example=2

Kill the task again to force a reschedule, and wait for the blocked eval:

$ nomad eval list
ID        Priority  Triggered By        Job ID   Namespace  Node ID   Status    Placement Failures
5db8c171  50        queued-allocs       example  default    <none>    blocked   N/A - In Progress
1b751548  50        alloc-failure       example  default    <none>    complete  true
...

Update the node metadata to unblock the eval

$ nomad node meta apply --node-id e6e43a5a example=1

And wait for the node update eval.

$ nomad eval list
ID        Priority  Triggered By        Job ID   Namespace  Node ID   Status    Placement Failures
6eac73f2  50        node-update         example  default    e6e43a5a  complete  false
5db8c171  50        queued-allocs       example  default    <none>    complete  false
1b751548  50        alloc-failure       example  default    <none>    complete  true
...

The replacement allocation has a reschedule tracker as we expect, which is what we fixed in #12319.

$ nomad job status example
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
1a99a69c  e6e43a5a  group       0        run      running  23s ago    13s ago
1914d5a9  e6e43a5a  group       0        stop     failed   3m54s ago  23s ago
4d64f58c  e6e43a5a  group       0        stop     failed   5m5s ago   3m54s ago

$ nomad operator api "/v1/allocation/1a99a69c-55bf-ddee-0c6d-6e54222b90bf" | jq .RescheduleTracker
{
  "Events": [
    {
      "Delay": 30000000000,
      "PrevAllocID": "4d64f58c-96cc-8465-82ba-e48241dbdba6",
      "PrevNodeID": "e6e43a5a-9ddb-d65a-521a-cde19f093656",
      "RescheduleTime": 1737576830218453000
    },
    {
      "Delay": 60000000000,
      "PrevAllocID": "1914d5a9-3610-75a9-025d-729a9dbed06c",
      "PrevNodeID": "e6e43a5a-9ddb-d65a-521a-cde19f093656",
      "RescheduleTime": 1737577040806473200
    }
  ],
  "LastReschedule": "ok"
}

Reschedule on Downed Node

Now halt the node (sudo systemctl stop nomad), and wait for it to be marked down.

$ nomad node status
ID        Node Pool  DC        Name     Class      Drain  Eligibility  Status
e6e43a5a  default    philly-1  client0  multipass  false  eligible     down

Wait for the blocked evaluation:

$ nomad job status example
...
Placement Failure
Task Group "group":
  * No nodes were eligible for evaluation

Allocations
ID        Node ID   Task Group  Version  Desired  Status  Created    Modified
1a99a69c  e6e43a5a  group       0        stop     lost    2m43s ago  23s ago
1914d5a9  e6e43a5a  group       0        stop     failed  6m14s ago  2m43s ago
4d64f58c  e6e43a5a  group       0        stop     failed  7m25s ago  6m14s ago

$ nomad eval list
ID        Priority  Triggered By        Job ID   Namespace  Node ID   Status    Placement Failures
17784deb  50        queued-allocs       example  default    <none>    blocked   N/A - In Progress
f34b6262  50        node-update         example  default    e6e43a5a  complete  true
...

Then restart the node and wait for the allocation to be unblocked:

$ nomad eval list
ID        Priority  Triggered By        Job ID   Namespace  Node ID   Status    Placement Failures
40652e21  50        node-update         example  default    e6e43a5a  complete  false
4e69a3fe  50        queued-allocs       example  default    <none>    complete  false
9b5ed7fd  50        node-update         example  default    e6e43a5a  complete  true
...

The allocation has been replaced but the replacement allocation doesn't have a reschedule tracker!

$ nomad job status example
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
3896afa8  e6e43a5a  group       0        run      running   19s ago    9s ago
1a99a69c  e6e43a5a  group       0        stop     complete  4m17s ago  14s ago
1914d5a9  e6e43a5a  group       0        stop     failed    7m48s ago  4m17s ago
4d64f58c  e6e43a5a  group       0        stop     failed    8m59s ago  7m48s ago

$ nomad operator api "/v1/allocation/3896afa8-c58b-f436-b4e9-3c5bb733f0b0" | jq .RescheduleTracker
null

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions