lxd-agent: Fixes intermittent exec EOF closure when vsock listener is restarted just after boot #12405
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR switches the lxd-agent vsock listener to use the VMADDR_CID_ANY (4294967295) CID, rather than trying to ascertain the VM's local CID and listening only on that.
The reason is two-fold:
We were seeing that sometimes the vsock.ContextID() call was returning 4294967295 just after the vsock module was loaded, but shortly afterward then started returning the correct CID assigned in QEMU. This would trigger the vsock CID change detector up to 30s later and cause the vsock listener to be restarted. Any ongoing exec operations that had started before that would be prematurely terminated. The vsock VID change detector was originally added to detect when a VM was statefully restored/migrated in such a way that its QEMU assigned CID was changed whilst the VM was running. This prevented LXD from using the lxd-agent until such time as the lxd-agent noticed its local CID had changed and restarted its listener on the new CID.
However it was observed during investigating this issue that if we bound the lxd-agent listener to the VMADDR_CID_ANY (4294967295) CID then this continue to work even if the VM was statefully restored using a different CID. This is because the VMADDR_CID_ANY seems to be used as a kind of wildcard CID. The vsock manpage says:
Consider using VMADDR_CID_ANY when binding instead of getting the local CID with
IOCTL_VM_SOCKETS_GET_LOCAL_CID.
There are several special addresses: VMADDR_CID_ANY (-1U) means any address for binding;
By binding to the VMADDR_CID_ANY address it also allows us to simplify the vsock listener logic and remove the vsock CID change detector entirely, neatly sidestepping the original problem.