Skip to content

Commit

Permalink
lxd-agent: Fixes intermittent exec EOF closure when vsock listener is…
Browse files Browse the repository at this point in the history
… restarted just after boot

This PR switches the lxd-agent vsock listener to use the VMADDR_CID_ANY (4294967295) CID,
rather than trying to ascertain the VM's local CID and listening only on that.

The reason is two-fold:

 1. We were seeing that sometimes the vsock.ContextID() call was returning 4294967295 just
    after the vsock module was loaded, but shortly afterward then started returning the
    correct CID assigned in QEMU. This would trigger the vsock CID change detector up to 30s
    later and cause the vsock listener to be restarted. Any ongoing exec operations that had
    started before that would be prematurely terminated. The vsock VID change detector was
    originally added to detect when a VM was statefully restored/migrated in such a way that
    its QEMU assigned CID was changed whilst the VM was running. This prevented LXD from
    using the lxd-agent until such time as the lxd-agent noticed its local CID had changed
    and restarted its listener on the new CID.

 2. However it was observed during investigating this issue that if we bound the lxd-agent
    listener to the VMADDR_CID_ANY (4294967295) CID then this continue to work even if the VM
    was statefully restored using a different CID. This is because the VMADDR_CID_ANY seems
    to be used as a kind of wildcard CID. The vsock manpage says:

     Consider using VMADDR_CID_ANY when binding instead of getting the local CID with
     IOCTL_VM_SOCKETS_GET_LOCAL_CID.

     There are several special addresses: VMADDR_CID_ANY (-1U) means any address for binding;

By binding to the VMADDR_CID_ANY address it also allows us to simplify the vsock listener logic
and remove the vsock CID change detector entirely, neatly sidestepping the original problem.

Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
  • Loading branch information
tomponline committed Oct 18, 2023
1 parent b6ec0f2 commit 4666e37
Show file tree
Hide file tree
Showing 3 changed files with 7 additions and 34 deletions.
9 changes: 7 additions & 2 deletions lxd-agent/api_1.0.go
Original file line number Diff line number Diff line change
Expand Up @@ -192,8 +192,13 @@ func getClient(CID uint32, port int, serverCertificate string) (*http.Client, er
}

func startHTTPServer(d *Daemon, debug bool) error {
// Setup the listener on VM's context ID for inbound connections from LXD.
l, err := vsock.Listen(shared.HTTPSDefaultPort, nil)
const CIDAny uint32 = 4294967295 // Equivalent to VMADDR_CID_ANY.

// Setup the listener on wildcard CID for inbound connections from LXD.
// We use the VMADDR_CID_ANY CID so that if the VM's CID changes in the future the listener still works.
// A CID change can occur when restoring a stateful VM that was previously using one CID but is
// subsequently restored using a different one.
l, err := vsock.ListenContextID(CIDAny, shared.HTTPSDefaultPort, nil)
if err != nil {
return fmt.Errorf("Failed to listen on vsock: %w", err)
}
Expand Down
6 changes: 0 additions & 6 deletions lxd-agent/daemon.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@ import (
"sync"

"github.com/canonical/lxd/lxd/events"
"github.com/canonical/lxd/lxd/vsock"
)

// A Daemon can respond to requests from a shared client.
Expand All @@ -17,8 +16,6 @@ type Daemon struct {
serverPort uint32
serverCertificate string

localCID uint32

// The channel which is used to indicate that the lxd-agent was able to connect to LXD.
chConnected chan struct{}

Expand All @@ -31,11 +28,8 @@ type Daemon struct {
func newDaemon(debug, verbose bool) *Daemon {
lxdEvents := events.NewServer(debug, verbose, nil)

cid, _ := vsock.ContextID()

return &Daemon{
events: lxdEvents,
chConnected: make(chan struct{}),
localCID: cid,
}
}
26 changes: 0 additions & 26 deletions lxd-agent/main_agent.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@ import (
"github.com/canonical/lxd/lxd/instance/instancetype"
"github.com/canonical/lxd/lxd/storage/filesystem"
"github.com/canonical/lxd/lxd/util"
"github.com/canonical/lxd/lxd/vsock"
"github.com/canonical/lxd/shared"
"github.com/canonical/lxd/shared/logger"
)
Expand Down Expand Up @@ -136,31 +135,6 @@ func (c *cmdAgent) Run(cmd *cobra.Command, args []string) error {
return fmt.Errorf("Failed to start HTTP server: %w", err)
}

// Check context ID periodically, and restart the HTTP server if needed.
go func() {
for range time.Tick(30 * time.Second) {
cid, err := vsock.ContextID()
if err != nil {
continue
}

if d.localCID == cid {
continue
}

// Restart server
servers["http"].Close()

err = startHTTPServer(d, c.global.flagLogDebug)
if err != nil {
errChan <- err
}

// Update context ID.
d.localCID = cid
}
}()

// Check whether we should start the devlxd server in the early setup. This way, /dev/lxd/sock
// will be available for any systemd services starting after the lxd-agent.
if shared.PathExists("agent.conf") {
Expand Down

0 comments on commit 4666e37

Please sign in to comment.