I have a test cluster running Talos with three nodes, one control plane and two workers. It worked well for almost a year, but stopped scheduling a few days ago because the satellite running on the control plane (gillespie) went dark, showing offline with linstor node list
+------------------------------------------------------------+
| Node | NodeType | Addresses | State |
|============================================================|
| bell | SATELLITE | 10.244.1.11:3366 (PLAIN) | Online |
| coryell | SATELLITE | 10.244.0.20:3366 (PLAIN) | Online |
| gillespie | SATELLITE | 10.244.2.56:3366 (PLAIN) | OFFLINE |
+------------------------------------------------------------+
The linstor-satellite, ha-controller and related pods are running only on the other two nodes. So I'm trying to figure out why the pods aren't starting on gillespie. No linstor-related pods are even stuck in a pending state on gillespie -- it's like they aren't even being created and trying to start.
The pattern of errors in the linstor error -log seem to just reflect the missing satellites:
┊ 695E880A-DA3D0-000006 ┊ 2026-01-07 17:04:40 ┊ S|coryell ┊ ResourceException: Failed to adjust DRBD resource pvc-7b575203-37a8-4508-b7d... ┊
┊ 695E880A-DA3D0-000007 ┊ 2026-01-07 17:04:40 ┊ S|coryell ┊ ResourceException: Failed to adjust DRBD resource pvc-e2bf2169-f142-493a-951... ┊
┊ 695E868F-7908B-000016 ┊ 2026-01-07 17:04:55 ┊ S|bell ┊ ResourceException: Failed to adjust DRBD resource pvc-7b575203-37a8-4508-b7d... ┊
┊ 695E868F-7908B-000017 ┊ 2026-01-07 17:04:55 ┊ S|bell ┊ ResourceException: Failed to adjust DRBD resource pvc-e2bf2169-f142-493a-951... ┊
I went through my notes from when I originally configured the operator on this cluster, and I didn't need to do anything special to create the linstor nodes for each server -- the nodes appear to have spun up automatically after I installed the operator.
When I inspect the controller logs, I see that the TaskScheduleService is continually establishing connections with gillespie and performing some actions similar to the following.
linstor-controller 2026-01-07 17:16:05.673 [TaskScheduleService] INFO LINSTOR/Controller/02cb29 SYSTEM - Establishing connection to node 'gillespie' via /10.244.2.56:3366 ...
linstor-controller 2026-01-07 17:16:13.587 [grizzly-http-server-2] INFO LINSTOR/Controller/b942f9 SYSTEM - REST/API RestClient(10.244.1.15; 'piraeus-operator/v2.10.3-4e0a21b886ff440c5cfea6760a0f83fd4daa0d47')/LstStorPool
linstor-controller 2026-01-07 17:16:13.597 [grizzly-http-server-0] INFO LINSTOR/Controller/d61874 SYSTEM - REST/API RestClient(10.244.1.15; 'piraeus-operator/v2.10.3-4e0a21b886ff440c5cfea6760a0f83fd4daa0d47')/LstVlm
linstor-controller 2026-01-07 17:16:13.607 [grizzly-http-server-1] INFO LINSTOR/Controller/589db6 SYSTEM - REST/API RestClient(10.244.1.15; 'piraeus-operator/v2.10.3-4e0a21b886ff440c5cfea6760a0f83fd4daa0d47')/LstSnapshotDfn
Nothing looks like an error.
This is a test environment that I could just rebuild, but I want to figure out how to recover from this type of situation.
I have a test cluster running Talos with three nodes, one control plane and two workers. It worked well for almost a year, but stopped scheduling a few days ago because the satellite running on the control plane (gillespie) went dark, showing offline with linstor node list
The linstor-satellite, ha-controller and related pods are running only on the other two nodes. So I'm trying to figure out why the pods aren't starting on gillespie. No linstor-related pods are even stuck in a pending state on gillespie -- it's like they aren't even being created and trying to start.
The pattern of errors in the linstor error -log seem to just reflect the missing satellites:
I went through my notes from when I originally configured the operator on this cluster, and I didn't need to do anything special to create the linstor nodes for each server -- the nodes appear to have spun up automatically after I installed the operator.
When I inspect the controller logs, I see that the TaskScheduleService is continually establishing connections with gillespie and performing some actions similar to the following.
Nothing looks like an error.
This is a test environment that I could just rebuild, but I want to figure out how to recover from this type of situation.