[vpp] Fix LAG subinterface hwif name resolution#1883
Conversation
* Resolve PortChannel parent to BondEthernetX directly instead of calling tap_to_hwif_name() which only maps physical Ethernet interfaces * Apply same fix to the remove path for symmetry Signed-off-by: Bojun-Feng <bojundf@gmail.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
* Allow LAG branch to fall through to vlan_id append logic instead of returning early without the subinterface suffix Signed-off-by: Bojun-Feng <bojundf@gmail.com>
81f62f4 to
8408572
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azpw retry |
|
Retrying failed(or canceled) jobs... |
|
Retrying failed(or canceled) stages in build 1106406: ✅Stage Build:
|
|
/azpw retry |
|
Retrying failed(or canceled) jobs... |
|
Retrying failed(or canceled) stages in build 1106406: ✅Stage Build:
|
|
/azpw retry |
|
Retrying failed(or canceled) jobs... |
|
Retrying failed(or canceled) stages in build 1106406: ✅Stage Test:
✅Stage BuildTrixie:
|
|
Note to self: Luckily we have the option to only rerun failed tests, else it would have taken a large number of runs to get both tests passing on all builds simutaneously. Need to open an issue (or find existing issue) to track this. |
|
Hi @cyw233, CI is all green. Would appreciate your review when you have time, thanks! Additionally, I have not yet verified the fix by enabling the sonic-mgmt I can try to set up the environment on my end and run the test but that would take a few days. Would it be possible for you to help verify it? |
@Bojun-Feng , great to see the PR checkers are passing! We are currently having some internal limitations/restrictions on setting up the environment and still working a solution... Maybe let's both set up our environment and run the test in parallel. Will keep you updated here |
|
Hi @Bojun-Feng It seems that your patch can fix the issue! Today I applied your commits and used the new image for a test. It seems that the LAG sub-interface can be displayed via vppctl. Following is the test steps. (BondEthernet1.10 will not be there when I test the image without your patch) |
|
Thanks for the update @lunyue-ms! Looking forward to the CI testbed results next week. Let me know if anything comes up during the t1-lag-vpp topology run, happy to help debug if needed. |
|
Hi @Bojun-Feng I setup my local test environment and did some tests. The VLAN related sub interface tests were failed. I'm trying to figure out what's the possible reason now and will update the ticket after. |
Hi @lunyue-ms, thanks for sharing! I wanted to provide some updates — I did some research on the log to look into why the Seems like the test expects ICMP replies from the DUT's sub-interfaces (e.g., Looking at the code, the root cause is in sonic-sairedis/vslib/vpp/SwitchVppRoute.cpp Lines 148 to 151 in 88bc51a This was commented out in the upstream I believe the the original There are several directions out of this:
A targeted hot-fix would be option (1) scoped to only I think this is out of scope for this PR (which addresses LAG hwif name resolution) and would also live in a different repository. Happy to look into a follow-up hot-fix PR for this if it would be helpful. |
|
Hi @Bojun-Feng , Btw, I ran three VPP L3 sub-interface tests on the DUT (vlab-vpp-01) with tap1.20 (172.16.20.1/30) and tap2.30(172.16.30.1/30), and set up ns1/ns2 as traffic endpoints (172.16.20.2/30, 172.16.30.2/30).
Additional observation: the same test logic was validated on a separate non-SONiC Ubuntu machine, and it passed there. This indicates the test method is sound, and the failure appears specific to the SONiC-VPP DUT environment (gateway/local interface reply path). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
30d9e25 to
b0d2aad
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Hi @lunyue-ms, I implemented the hot-fix for the sub-interface IP programming issue we discussed at #1895. Since it touches a different file (SwitchVppRoute.cpp) and addresses a separate bug, I opened it as a new PR. Also happy to consolidate into this PR if you'd prefer them together. I also added a small refactor commit on the current PR, adding a helper to deduplicate the LAG→BondEthernet name resolution that was copy-pasted in 4 places. Seems like the current CI testing server is out of memory, which resulted in tests passing but log collection failing. See the Azure Pipeline page (Snapshot)for more details. Would appreciate if you can help verify on the testbed when you get a chance. Thanks! |
|
/azpw retry |
|
Retrying failed(or canceled) jobs... |
|
Retrying failed(or canceled) stages in build 1117403: ✅Stage TestAsan:
✅Stage Test:
|
|
Hi @Bojun-Feng , Thanks for your update. I was debugging tests related to sub_port_interfaces with your previous changes in this PR. Currently, some tests got passed but somes are still failed. Passed tests: Failed tests: I'm still debugging on this and didn't submit PRs so far. I think that you can go ahead with the changes in this PR. Tomorrow I will update use the latest changeset in this PR. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
03fdae1 to
4821073
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azpw retry |
|
Retrying failed(or canceled) jobs... |
|
Retrying failed(or canceled) stages in build 1120321: ✅Stage BuildSwss:
|
|
Seems like an unrelated linker error in SWSS @lunyue-ms I have pushed some additional commits that should help with the For the Please let me know if the new changes helped, thanks! Note: The previous subport removal implementation called The motivation was to avoid If this approach still doesn't pass the test, the commit should be reverted. If the test is passing, then the proper long-term fix would be |
|
Hi @Bojun-Feng , quick status: I opened two PRs for this work, both currently open:
Full
On your SVI bits ( |
fb95b84 to
582a8af
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Hi @lunyue-ms, thanks for the detailed update and for opening #1907! Great to see the test results improving. I've reset this PR to the scoped LAG hwif fix only (commits 1-2). The later commits (sub-port IP programming, admin-down preservation, etc.) are superseded by your approach, so I've dropped them to avoid conflicts. Since this PR is now a clean, self-contained fix for the LAG hwif name resolution bug (which #1907 also depends on), would it make sense to merge this one first? That way #1907 can rebase cleanly on top and we avoid duplicate commits. Happy to coordinate on timing. For the SVI work, I'm happy to pick that up as a follow-up once #1907 lands. Thanks for the offer to collaborate on it! @cyw233 The PR is now in its final form, just the two LAG hwif fixes you helped triage in sonic-net/sonic-platform-vpp#227. Would appreciate your review when you have a moment. |
|
Hi @Bojun-Feng, thanks for the update. I'm happy to wait for thie PR be merged and I can rebase #1907 then. |
Signed-off-by: Bojun-Feng <bojundf@gmail.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Description of PR
LAG subinterfaces (PortChannelX.vlan) are non-functional on SONiC-VPP due to two related bugs:
The RIF SUB_PORT create/delete paths call
create_sub_interface(tap_to_hwif_name(dev), vlan_id, vlan_id)wheredevis "PortChannelX". Sincesonic_vpp_ifmap.inionly maps physical Ethernet interfaces,tap_to_hwif_name("PortChannelX")returns "Unknown" and the VPP subinterface is never created on the correct BondEthernet parent.vpp_get_hwif_name()returns early for LAG objects without appending the.vlan_idsuffix, causingvpp_set_interface_mtuandvpp_set_interface_stateto target the parent BondEthernet instead of the subinterface.Changes made:
ot == SAI_OBJECT_TYPE_LAGand construct the VPP hwif name asBondEthernetXfrombond_info.idinstead of callingtap_to_hwif_name()vpp_get_hwif_name()to let the LAG branch fall through to the vlan_id append logic so MTU/state operations target the correct subinterfaceSummary:
Fixes sonic-net/sonic-platform-vpp#227
Type of change
Approach
What is the motivation for this PR?
LAG subinterfaces are broken on VPP platform. The
test_packet_routed_with_valid_vlan[port_in_lag]test fails because the VPP subinterface is never created on the correct BondEthernet parent. Additionally, post-creation operations (MTU, admin state) would target the wrong interface due tovpp_get_hwif_name()returning the parent bond name without the.vlan_idsuffix.Work item tracking
How did you do it?
Applied the same LAG-aware hwif resolution pattern already used in the VRF assignment section and in SwitchVppFdb.cpp:
In the SUB_PORT create/delete paths, check
ot == SAI_OBJECT_TYPE_LAGand construct"BondEthernetX"directly frombond_info.id(already populated viaget_lag_bond_info()) instead of callingtap_to_hwif_name().In
vpp_get_hwif_name(), restructured the LAG branch to sethwifnameand fall through to the sharedvlan_idappend logic (theif (vlan_id) { snprintf(..., "%s.%u", ...) }block) rather than returning early with just the parent name.How did you verify/test it?
Todo: Run
test_packet_routed_with_valid_vlan[port_in_lag]in the t1-lag-vpp topology (Platform: kvm, ASIC type: vpp).Any platform specific information?
VPP platform only. Physical port subinterfaces and non-subinterface LAG operations are unaffected (the
elsebranches preserve original behavior).Documentation
N/A