Open
Description
one of our nodes went down (hpc-01) due to not being able to successfully start lxd, and meanwhile we noticed the following while trying to create a forward (hpc-05) on a running instance on healthy node
ubuntu@hpc-05:~$ lxc network forward create default target_address=10.22.80.11 --allocate=ipv4
Error: Failed creating forward: Peer cluster member hpc-01 at REDACTED.24.228:8443 is down
some context if it helps
$ lxc cluster list -f compact
NAME URL ROLES ARCHITECTURE FAILURE DOMAIN DESCRIPTION STATE MESSAGE
hpc-01 https://REDACTED.24.228:8443 x86_64 default EVACUATED Unavailable due to maintenance
hpc-02 https://REDACTED.24.229:8443 database-leader x86_64 default ONLINE Fully operational
database
hpc-03 https://REDACTED.24.230:8443 database x86_64 default ONLINE Fully operational
hpc-04 https://REDACTED.24.231:8443 database x86_64 default ONLINE Fully operational
hpc-05 https://REDACTED.24.232:8443 database-standby x86_64 default ONLINE Fully operational
# from node 5 to 1 (down)
$ lxc monitor --pretty
DEBUG [2025-03-05T12:47:04+01:00] Handling API request ip=@ method=GET protocol=unix url=/1.0 username=ubuntu
DEBUG [2025-03-05T12:47:04+01:00] Handling API request ip=@ method=POST protocol=unix url=/1.0/networks/default/forwards username=ubuntu
DEBUG [2025-03-05T12:47:12+01:00] Replace current raft nodes raftMembers="[{{2 REDACTED.24.230:8443 voter} hpc-03} {{3 REDACTED.24.229:8443 voter} hpc-02} {{4 REDACTED.24.231:8443 voter} hpc-04} {{5 REDACTED.24.232:8443 stand-by} hpc-05} {{1 REDACTED.24.228:8443 spare} hpc-01}]"
DEBUG [2025-03-05T12:47:12+01:00] Matched trusted cert fingerprint=bece91657c3b53df861f792d6ecc36154e2002f3156e409740a611822ffe78d8 subject="CN=root@hpc-02,O=LXD"
WARNING[2025-03-05T12:47:13+01:00] Failed heartbeat err="Failed to send heartbeat request: Put \"https://REDACTED.24.228:8443/internal/database\": dial tcp REDACTED.24.228:8443: connect: connection refused" remote="REDACTED.24.228:8443"
WARNING[2025-03-05T12:47:18+01:00] Failed heartbeat err="Failed to send heartbeat request: Put \"https://REDACTED.24.228:8443/internal/database\": dial tcp REDACTED.24.228:8443: connect: connection refused" remote="REDACTED.24.228:8443"
DEBUG [2025-03-05T12:47:20+01:00] Replace current raft nodes raftMembers="[{{2 REDACTED.24.230:8443 voter} hpc-03} {{3 REDACTED.24.229:8443 voter} hpc-02} {{4 REDACTED.24.231:8443 voter} hpc-04} {{5 REDACTED.24.232:8443 stand-by} hpc-05} {{1 REDACTED.24.228:8443 spare} hpc-01}]"
on hpc-01, all microcloud services are stopped , and we also tried shutting down the node.
ubuntu@hpc-01:~$ snap services
Service Startup Current Notes
lxd.activate enabled inactive -
lxd.daemon enabled inactive socket-activated
lxd.user-daemon enabled inactive socket-activated
microceph.daemon enabled inactive -
microceph.mds enabled inactive -
microceph.mgr enabled inactive -
microceph.mon enabled inactive -
microceph.osd enabled inactive -
microceph.rbd-mirror disabled inactive -
microceph.rgw disabled inactive -
microcloud.daemon enabled inactive -
microovn.chassis enabled inactive -
microovn.daemon enabled inactive -
microovn.ovn-northd enabled inactive -
microovn.ovn-ovsdb-server-nb enabled inactive -
microovn.ovn-ovsdb-server-sb enabled inactive -
microovn.refresh-expiring-certs enabled inactive timer-activated
microovn.switch enabled inactive -
edit
if we try to create a new ovn network it fails with the same error. also delete it, so guess any ovn operation will face same destiny.
we will also get this message error on the lxd ui
Could not load network state: Failed to run: ovn-sbctl --timeout=10 --db ssl:150.161.24.228:6642,ssl:150.161.24.230:6642,ssl:150.161.24.229:6642 -c /proc/self/fd/3 -p /proc/self/fd/4 -C /proc/self/fd/5 get Chassis hostname: exit status 1 (ovn-sbctl: no row "" in table Chassis)
Metadata
Metadata
Assignees
Labels
No labels
Activity