Skip to content

Satellite node offline, how to get it back #935

@grumblebrian

Description

@grumblebrian

I have a test cluster running Talos with three nodes, one control plane and two workers. It worked well for almost a year, but stopped scheduling a few days ago because the satellite running on the control plane (gillespie) went dark, showing offline with linstor node list

+------------------------------------------------------------+
| Node      | NodeType  | Addresses                | State   |
|============================================================|
| bell      | SATELLITE | 10.244.1.11:3366 (PLAIN) | Online  |
| coryell   | SATELLITE | 10.244.0.20:3366 (PLAIN) | Online  |
| gillespie | SATELLITE | 10.244.2.56:3366 (PLAIN) | OFFLINE |
+------------------------------------------------------------+

The linstor-satellite, ha-controller and related pods are running only on the other two nodes. So I'm trying to figure out why the pods aren't starting on gillespie. No linstor-related pods are even stuck in a pending state on gillespie -- it's like they aren't even being created and trying to start.

The pattern of errors in the linstor error -log seem to just reflect the missing satellites:

┊ 695E880A-DA3D0-000006 ┊ 2026-01-07 17:04:40 ┊ S|coryell ┊ ResourceException: Failed to adjust DRBD resource pvc-7b575203-37a8-4508-b7d... ┊
┊ 695E880A-DA3D0-000007 ┊ 2026-01-07 17:04:40 ┊ S|coryell ┊ ResourceException: Failed to adjust DRBD resource pvc-e2bf2169-f142-493a-951... ┊
┊ 695E868F-7908B-000016 ┊ 2026-01-07 17:04:55 ┊ S|bell    ┊ ResourceException: Failed to adjust DRBD resource pvc-7b575203-37a8-4508-b7d... ┊
┊ 695E868F-7908B-000017 ┊ 2026-01-07 17:04:55 ┊ S|bell    ┊ ResourceException: Failed to adjust DRBD resource pvc-e2bf2169-f142-493a-951... ┊

I went through my notes from when I originally configured the operator on this cluster, and I didn't need to do anything special to create the linstor nodes for each server -- the nodes appear to have spun up automatically after I installed the operator.

When I inspect the controller logs, I see that the TaskScheduleService is continually establishing connections with gillespie and performing some actions similar to the following.

linstor-controller 2026-01-07 17:16:05.673 [TaskScheduleService] INFO  LINSTOR/Controller/02cb29 SYSTEM - Establishing connection to node 'gillespie' via /10.244.2.56:3366 ...
linstor-controller 2026-01-07 17:16:13.587 [grizzly-http-server-2] INFO  LINSTOR/Controller/b942f9 SYSTEM - REST/API RestClient(10.244.1.15; 'piraeus-operator/v2.10.3-4e0a21b886ff440c5cfea6760a0f83fd4daa0d47')/LstStorPool
linstor-controller 2026-01-07 17:16:13.597 [grizzly-http-server-0] INFO  LINSTOR/Controller/d61874 SYSTEM - REST/API RestClient(10.244.1.15; 'piraeus-operator/v2.10.3-4e0a21b886ff440c5cfea6760a0f83fd4daa0d47')/LstVlm
linstor-controller 2026-01-07 17:16:13.607 [grizzly-http-server-1] INFO  LINSTOR/Controller/589db6 SYSTEM - REST/API RestClient(10.244.1.15; 'piraeus-operator/v2.10.3-4e0a21b886ff440c5cfea6760a0f83fd4daa0d47')/LstSnapshotDfn

Nothing looks like an error.

This is a test environment that I could just rebuild, but I want to figure out how to recover from this type of situation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions