-
Notifications
You must be signed in to change notification settings - Fork 378
linux: add support for net devices #1750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer's GuideThis PR introduces netlink-based support for moving, renaming, and configuring network interfaces inside container network namespaces, integrates a startup hook to apply user-specified net devices, advertises the netDevices capability in the OCI features JSON, and updates the build to include the new net_device code. File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a proper review but just come across this by chance
src/libcrun/net_device.c
Outdated
|
||
req.nlh.nlmsg_len = NLMSG_LENGTH (sizeof (struct rtgenmsg)); | ||
req.nlh.nlmsg_type = RTM_GETLINK; | ||
req.nlh.nlmsg_flags = NLM_F_REQUEST | NLM_F_DUMP; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually you don't have to dump all links to get a single one, you can request by name and that is what we are doing in netavark.
https://github.com/containers/netavark/blob/c9649e875a9d5932fad8293095bac2938288a93f/src/network/netlink.rs#L132
This makes things much simpler as you don't have to loop over all links and compare each name yourself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, that was one of the things I wanted to address before moving it out of Draft, plus some other cleanups
3ce1a2d
to
3e7e93f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @giuseppe - I've reviewed your changes - here's some feedback:
- Using vfork to call setup_network_device_in_ns is unsafe because the child invokes non-async-signal-safe functions; switch to a regular fork or exec-based approach to avoid undefined behavior.
- net_device.c has a lot of repeated netlink setup/teardown code—consider refactoring common socket, seq, and message handling into helper functions to reduce duplication and improve readability.
- Relying on a fixed 8 KB buffer for all netlink messages may lead to truncation or overflow in complex scenarios; either validate message sizes more rigorously or switch to dynamically sizing the buffer.
Here's what I looked at during the review
- 🟡 General issues: 1 issue found
- 🟢 Testing: all looks good
- 🟢 Complexity: all looks good
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
src/libcrun/net_device.c
Outdated
ret = addattr (nlh, buffer_size, IFLA_IFNAME, newifname, strlen (newifname) + 1, err); | ||
if (UNLIKELY (ret < 0)) | ||
return ret; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): No validation of newifname length before passing to addattr.
Validate newifname length against IFNAMSIZ before calling addattr to avoid potential overflow.
ret = addattr (nlh, buffer_size, IFLA_IFNAME, newifname, strlen (newifname) + 1, err); | |
if (UNLIKELY (ret < 0)) | |
return ret; | |
/* validate interface name length (including NUL) */ | |
if (UNLIKELY (strlen (newifname) + 1 > IFNAMSIZ)) | |
return crun_make_error (err, EINVAL, "interface name too long: %s", newifname); | |
ret = addattr (nlh, buffer_size, IFLA_IFNAME, newifname, strlen (newifname) + 1, err); | |
if (UNLIKELY (ret < 0)) | |
return ret; |
333376a
to
c730410
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a plan to expose this feature somehow in podman? Or do you know if docker will expose this? Of course spec wise it seems straight forward but I mostly wonder about the podman/docker UI. I guess I should create an issue on podman and we can talk about it there.
src/libcrun/net_device.c
Outdated
if (ifa->ifa_index != ifindex) | ||
{ | ||
nlh = NLMSG_NEXT (nlh, len); | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strace ip addr show dev lo
That we can also short circuit this by requesting address on the link only from the kernel instead of filtering them here userspace.
I assume something like req->ifa.ifa_index = ifindex
on line 229?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that doesn't seem to work, I always get the whole list, let me try again and if perhaps there is a different command to achieve it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm.. looks like I need to use NETLINK_GET_STRICT_CHK
src/libcrun/net_device.c
Outdated
ret = append_rtattr (&(req->nlh), buffer_size, IFA_LOCAL, &ip->data, ip->len, err); | ||
if (UNLIKELY (ret < 0)) | ||
return ret; | ||
ret = append_rtattr (&(req->nlh), buffer_size, IFA_ADDRESS, &ip->data, ip->len, err); | ||
if (UNLIKELY (ret < 0)) | ||
return ret; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you only collect IFA_ADDRESS on the dump but then assume you can set both IFA_ADDRESS and IFA_LOCAL here
From include/uapi/linux/if_addr.h:
- Important comment:
- IFA_ADDRESS is prefix address, rather than local interface address.
- It makes no difference for normally configured broadcast interfaces,
- but for point-to-point IFA_ADDRESS is DESTINATION address,
- local address is supplied in IFA_LOCAL attribute.
Also comparing against netavark we only set IFA_LOCAL but we do also set IFA_BROADCAST for ipv4 only.
Skipping the IFA_BROADCAST broadcast in netavark seems to create an interface where the broadcast address is not shown in ip addr. Not sure if this means one cannot use broadcast then but I guess such behavior would not be expected from this move interface feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aojea any suggestions here? Shoul we restore IFA_BROADCAST?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the logic in the golang library https://github.com/vishvananda/netlink/blob/17daef607c6442d47b0565343cf8a69f985a4cb7/addr_linux.go#L128-L145
It infers the broadcast from the IP address, that comes in CIDR format IP/MASK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be documented in the specs?
Added the same logic now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree relying on the behavior of one golang netlink lib seems odd as part of the spec?
The spec currently says
The runtime MUST preserve existing network interface attributes, including all
permanent IP addresses (IFA_F_PERMANENT flag) of any family with global scope
(RT_SCOPE_UNIVERSE value) as defined in [RFC 3549 Section 2.3.3.2
][rfc3549].
This ensures that only addresses intended for persistent, external communication
are transferred.
To me this means all netlink attributes should be preserved for all addresses that have IFA_F_PERMANENT and RT_SCOPE_UNIVERSE set not just just a few IFA_LOCAL/IFA_ADDRESS/IFA_BROADCAST.
And always setting IFA_BROADCAST like that would mean if the host interface did not have IFA_BROADCAST broadcast set then the container interface all of the sudden has it which certainly feels wrong, I would expect the runtime to not add new attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Luap99 I agree, let's drop this mechanism for now, it is surprising that the broadcast is configured when it is not present.
The patch to add it back is easy in case we need it:
diff --git a/src/libcrun/net_device.c b/src/libcrun/net_device.c
index 9231bc08..b25d8ab9 100644
--- a/src/libcrun/net_device.c
+++ b/src/libcrun/net_device.c
@@ -328,6 +328,8 @@ configure_ip_addresses (int sock, int ifindex, char *buffer, size_t buffer_size,
for (ip = ips; ip->rta_len >= 0; ip++)
{
+ struct in_addr ifa_addr = { 0 };
+ bool has_broadcast = false;
/* RTA_NEXT modifies the argument, so use a copy. */
int rta_len = ip->rta_len;
struct rtattr *rta;
@@ -341,11 +343,41 @@ configure_ip_addresses (int sock, int ifindex, char *buffer, size_t buffer_size,
for (rta = (struct rtattr *) ip->rta; RTA_OK (rta, rta_len); rta = RTA_NEXT (rta, rta_len))
{
+ switch (rta->rta_type)
+ {
+ case IFA_ADDRESS:
+ if (ip->ifa.ifa_family == AF_INET)
+ {
+ if (RTA_PAYLOAD (rta) != sizeof (ifa_addr.s_addr))
+ return crun_make_error (err, 0, "internal error: wrong size type for AF_INET address");
+ memcpy (&ifa_addr.s_addr, RTA_DATA (rta), sizeof (ifa_addr.s_addr));
+ }
+ break;
+ case IFA_BROADCAST:
+ has_broadcast = true;
+ break;
+
+ default:
+ break;
+ }
+
ret = append_rtattr (&(req->nlh), buffer_size, rta->rta_type, RTA_DATA (rta), RTA_PAYLOAD (rta), err);
if (UNLIKELY (ret < 0))
return ret;
}
+ if (ip->ifa.ifa_family == AF_INET && ! has_broadcast && ip->ifa.ifa_prefixlen > 0
+ && ip->ifa.ifa_prefixlen < 31 && ifa_addr.s_addr != 0)
+ {
+ /* Guess the broadcast address when it is not specified. */
+ uint32_t mask = htonl (~((1ULL << (32 - ip->ifa.ifa_prefixlen)) - 1));
+ ifa_addr.s_addr |= ~mask;
+
+ ret = append_rtattr (&(req->nlh), buffer_size, IFA_BROADCAST, &ifa_addr, sizeof (ifa_addr), err);
+ if (UNLIKELY (ret < 0))
+ return ret;
+ }
+
ret = send_request (sock, req, err);
if (UNLIKELY (ret < 0))
return ret;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry let me clarify, because I leaked some details of the implementation on the comment and make it more confusing, in the golang library the library translates the address characteristics to golang types and it changes the IFA_BROADCAST to a net.IPNet with ip/mask and when it recovers it recalculates the IFA_BROADCAST field, so effectively it passes IFA_BROADCAST
so inet inet 192.168.8.5/24
is IFA_LOCAL=192.168.8.5 and IFA_BROADCAST=192.168.8.255
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=88, nlmsg_type=RTM_NEWADDR, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=1747830917, nlmsg_pid=45678}, {ifa_family=AF_INET, ifa_prefixlen=24, ifa_flags=IFA_F_PERMANENT, ifa_scope=RT_SCOPE_UNIVERSE, ifa_index=if_nametoindex("eth0")}, [[{nla_len=8, nla_type=IFA_ADDRESS}, inet_addr("192.168.8.5")], [{nla_len=8, nla_type=IFA_LOCAL}, inet_addr("192.168.8.5")], [{nla_len=8, nla_type=IFA_BROADCAST}, inet_addr("192.168.8.255")], [{nla_len=9, nla_type=IFA_LABEL}, "eth0"], [{nla_len=8, nla_type=IFA_FLAGS}, IFA_F_PERMANENT], [{nla_len=20, nla_type=IFA_CACHEINFO}, {ifa_prefered=4294967295, ifa_valid=4294967295, cstamp=147250349, tstamp=147250349}]]], [{nlmsg_len=72, nlmsg_type=RTM_NEWADDR, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=1747830917, nlmsg_pid=45678}, {ifa_family=AF_INET6, ifa_prefixlen=64, ifa_flags=IFA_F_NODAD|IFA_F_PERMANENT, ifa_scope=RT_SCOPE_UNIVERSE, ifa_index=if_nametoindex("eth0")}, [[{nla_len=20, nla_type=IFA_ADDRESS}, inet_pton(AF_INET6, "fc00:f853:ccd:e793::5")], [{nla_len=20, nla_type=IFA_CACHEINFO}, {ifa_prefered=4294967295, ifa_valid=4294967295, cstamp=147250352, tstamp=147250352}], [{nla_len=8, nla_type=IFA_FLAGS}, IFA_F_NODAD|IFA_F_PERMANENT]]], [{nlmsg_len=80, nlmsg_type=RTM_NEWADDR, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=1747830917, nlmsg_pid=45678}, {ifa_family=AF_INET6, ifa_prefixlen=64, ifa_flags=IFA_F_PERMANENT, ifa_scope=RT_SCOPE_LINK, ifa_index=if_nametoindex("eth0")}, [[{nla_len=20, nla_type=IFA_ADDRESS}, inet_pton(AF_INET6, "fe80::42:c0ff:fea8:805")], [{nla_len=20, nla_type=IFA_CACHEINFO}, {ifa_prefered=4294967295, ifa_valid=4294967295, cstamp=147250352, tstamp=147250352}], [{nla_len=8, nla_type=IFA_FLAGS}, IFA_F_PERMANENT], [{nla_len=5, nla_type=IFA_PROTO}, "\x03"]]]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 240
so I agree with @Luap99 interpretations, because if user has 192.168.8.5/24 configured and we do not move the broadcast address, it will have 192.168.8.5/32 on the container that will break applications that depend on that interface to cover the entire subnet range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aojea thanks, so are you ok with the current implementation where the BROADCAST is propagated only when it is set (and not always forced to be present)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that is the expectation, scratch my first comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wonderful, thanks for confirming it
I am adding it only to be compatible with the spec. We already have a tool to configure the network, would it be helpful for us to ask the OCI runtime to do the final move? Would it improve anything? |
nlh = NLMSG_NEXT (nlh, len); | ||
continue; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh and I just fully read the spec
The runtime MUST preserve existing network interface attributes, including all
permanent IP addresses (IFA_F_PERMANENT flag) of any family with global scope
(RT_SCOPE_UNIVERSE value) as defined in [RFC 3549 Section 2.3.3.2
][rfc3549].
This ensures that only addresses intended for persistent, external communication
are transferred.
You already apply IFA_F_PERMANENT to the address when you set them but this here doesn't seem to skip addresses without IFA_F_PERMANENT set? I guess we would not want to copy an ipv6 link local address.
Not if we fully manage the netns with netavark already, but I know we had a few folks asking to just move an existing interface (containers/podman#25511) and for them this would be the perfect fit I think. Obviously we could add such feature in netavark but if the runtime supports it we might as well use and expose that. |
a6d077b
to
08285fa
Compare
Ephemeral COPR build failed. @containers/packit-build please check. |
c2e340a
to
cc1f541
Compare
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
cc1f541
to
10434ef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @giuseppe - I've reviewed your changes and they look great!
Here's what I looked at during the review
- 🟡 General issues: 2 issues found
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟢 Complexity: all looks good
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
src/libcrun/net_device.c
Outdated
ret = setsockopt (sock, SOL_NETLINK, NETLINK_GET_STRICT_CHK, &optval, sizeof (optval)); | ||
if (ret < 0) | ||
return crun_make_error (err, errno, "setsockopt (NETLINK_GET_STRICT_CHK)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: NETLINK_GET_STRICT_CHK may not be available on all kernels.
Wrap the setsockopt call in a feature check or catch ENOPROTOOPT, falling back gracefully when NETLINK_GET_STRICT_CHK isn’t supported.
ret = setsockopt (sock, SOL_NETLINK, NETLINK_GET_STRICT_CHK, &optval, sizeof (optval)); | |
if (ret < 0) | |
return crun_make_error (err, errno, "setsockopt (NETLINK_GET_STRICT_CHK)"); | |
#ifdef NETLINK_GET_STRICT_CHK | |
ret = setsockopt (sock, SOL_NETLINK, NETLINK_GET_STRICT_CHK, &optval, sizeof (optval)); | |
if (ret < 0) | |
{ | |
if (errno != ENOPROTOOPT) | |
return crun_make_error (err, errno, "setsockopt (NETLINK_GET_STRICT_CHK)"); | |
/* NETLINK_GET_STRICT_CHK not supported by this kernel, continue without strict checking */ | |
} | |
#endif |
src/libcrun/net_device.c
Outdated
if (ip->ifa.ifa_family == AF_INET && ! has_broadcast && ip->ifa.ifa_prefixlen < 31 && ifa_addr.s_addr != 0) | ||
{ | ||
/* Guess the broadcast address when it is not specified. */ | ||
uint32_t mask = htonl (~((1ULL << (32 - ip->ifa.ifa_prefixlen)) - 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Bitwise operations for mask calculation may have undefined behavior for prefixlen=0.
Guard against ip->ifa.ifa_prefixlen == 0 before computing the mask to avoid the undefined left shift by 32 bits.
e88f3b8
to
76f1006
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel qualified to review c code especially with a bunch of buffer copies from/to netlink types but netlink logic wise this looks good to me.
Not sure how you handle testing here normally, I would assume it should be quite simple to
ip link add test1 type dummy
ip addr add 10.0.0.0/24 dev test1
And then move that into the namespace with crun and run ip addr
in the container to check if the address was copied properly? Not sure if this is easy to do here.
that would be easy to add, but so far I've tried to avoid adding tests that modify the host. I'll check if I can do that from a new network namespace used only for the test |
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
opencontainers/runtime-spec#1271 added support for moving existing network devices to the container network namespace. Closes: containers#1712 Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
76f1006
to
07374bb
Compare
tests added |
@flouthoc PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
👏 |
Closes: #1712
Summary by Sourcery
Add support for moving network devices into the container network namespace based on the OCI spec
New Features:
linux.net_devices
configuration to specify host interfaces to move and rename into the containermove_network_device
andlibcrun_move_network_devices
to use netlink for transferring IP settings and activating interfaces in the target namespaceEnhancements:
netDevices
in the OCI features outputBuild:
net_device.c
andnet_device.h
in the build configuration (Makefile.am)