Skip to content

lxc network forward create (or any network operation) will fail if one node is down #15115

Open
@wideawakening

Description

@wideawakening

one of our nodes went down (hpc-01) due to not being able to successfully start lxd, and meanwhile we noticed the following while trying to create a forward (hpc-05) on a running instance on healthy node

ubuntu@hpc-05:~$ lxc network forward create default target_address=10.22.80.11 --allocate=ipv4
Error: Failed creating forward: Peer cluster member hpc-01 at REDACTED.24.228:8443 is down

some context if it helps


$ lxc cluster list -f compact
   NAME               URL                   ROLES        ARCHITECTURE  FAILURE DOMAIN  DESCRIPTION    STATE               MESSAGE              
  hpc-01  https://REDACTED.24.228:8443                    x86_64        default                      EVACUATED  Unavailable due to maintenance  
  hpc-02  https://REDACTED.24.229:8443  database-leader   x86_64        default                      ONLINE     Fully operational               
                                       database                                                                                                
  hpc-03  https://REDACTED.24.230:8443  database          x86_64        default                      ONLINE     Fully operational               
  hpc-04  https://REDACTED.24.231:8443  database          x86_64        default                      ONLINE     Fully operational               
  hpc-05  https://REDACTED.24.232:8443  database-standby  x86_64        default                      ONLINE     Fully operational    
  
  
# from node 5 to 1 (down)
$ lxc monitor --pretty
DEBUG  [2025-03-05T12:47:04+01:00] Handling API request                          ip=@ method=GET protocol=unix url=/1.0 username=ubuntu
DEBUG  [2025-03-05T12:47:04+01:00] Handling API request                          ip=@ method=POST protocol=unix url=/1.0/networks/default/forwards username=ubuntu
DEBUG  [2025-03-05T12:47:12+01:00] Replace current raft nodes                    raftMembers="[{{2 REDACTED.24.230:8443 voter} hpc-03} {{3 REDACTED.24.229:8443 voter} hpc-02} {{4 REDACTED.24.231:8443 voter} hpc-04} {{5 REDACTED.24.232:8443 stand-by} hpc-05} {{1 REDACTED.24.228:8443 spare} hpc-01}]"
DEBUG  [2025-03-05T12:47:12+01:00] Matched trusted cert                          fingerprint=bece91657c3b53df861f792d6ecc36154e2002f3156e409740a611822ffe78d8 subject="CN=root@hpc-02,O=LXD"
WARNING[2025-03-05T12:47:13+01:00] Failed heartbeat                              err="Failed to send heartbeat request: Put \"https://REDACTED.24.228:8443/internal/database\": dial tcp REDACTED.24.228:8443: connect: connection refused" remote="REDACTED.24.228:8443"
WARNING[2025-03-05T12:47:18+01:00] Failed heartbeat                              err="Failed to send heartbeat request: Put \"https://REDACTED.24.228:8443/internal/database\": dial tcp REDACTED.24.228:8443: connect: connection refused" remote="REDACTED.24.228:8443"
DEBUG  [2025-03-05T12:47:20+01:00] Replace current raft nodes                    raftMembers="[{{2 REDACTED.24.230:8443 voter} hpc-03} {{3 REDACTED.24.229:8443 voter} hpc-02} {{4 REDACTED.24.231:8443 voter} hpc-04} {{5 REDACTED.24.232:8443 stand-by} hpc-05} {{1 REDACTED.24.228:8443 spare} hpc-01}]"

on hpc-01, all microcloud services are stopped , and we also tried shutting down the node.

ubuntu@hpc-01:~$ snap services
Service                          Startup   Current   Notes
lxd.activate                     enabled   inactive  -
lxd.daemon                       enabled   inactive  socket-activated
lxd.user-daemon                  enabled   inactive  socket-activated
microceph.daemon                 enabled   inactive  -
microceph.mds                    enabled   inactive  -
microceph.mgr                    enabled   inactive  -
microceph.mon                    enabled   inactive  -
microceph.osd                    enabled   inactive  -
microceph.rbd-mirror             disabled  inactive  -
microceph.rgw                    disabled  inactive  -
microcloud.daemon                enabled   inactive  -
microovn.chassis                 enabled   inactive  -
microovn.daemon                  enabled   inactive  -
microovn.ovn-northd              enabled   inactive  -
microovn.ovn-ovsdb-server-nb     enabled   inactive  -
microovn.ovn-ovsdb-server-sb     enabled   inactive  -
microovn.refresh-expiring-certs  enabled   inactive  timer-activated
microovn.switch                  enabled   inactive  -

edit

if we try to create a new ovn network it fails with the same error. also delete it, so guess any ovn operation will face same destiny.

we will also get this message error on the lxd ui

Could not load network state: Failed to run: ovn-sbctl --timeout=10 --db ssl:150.161.24.228:6642,ssl:150.161.24.230:6642,ssl:150.161.24.229:6642 -c /proc/self/fd/3 -p /proc/self/fd/4 -C /proc/self/fd/5 get Chassis hostname: exit status 1 (ovn-sbctl: no row "" in table Chassis)

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions