FelixConfiguration and Felix changes for live migration#12038
FelixConfiguration and Felix changes for live migration#12038nelljerram wants to merge 17 commits intoprojectcalico:masterfrom
Conversation
- Add FelixConfiguration fields for route priority values.
Fix four bugs in LiveMigrationCalculator: 1. refDirectName used old name instead of new when a LiveMigration update changed its source/target name, so the new name was never tracked and the old name was re-referenced. 2. unrefSelector unreferenced the new selector instead of the old one when a LiveMigration's selector changed, leaving the old selector leaked and removing the new one prematurely. 3. OnComputedSelectorMatch/Stopped did not verify that the callback's selector was one the LMC had registered, so unrelated ARC selector callbacks could incorrectly set targetSelectorMatching on WEPs. 4. AddSet was called with a nil set (from a missing map entry) when a WEP arrived with no matching LiveMigration for one of source/target, causing a nil pointer dereference. Add 19 unit tests covering: direct name role assignment (both orderings), LM deletion, target-wins-over-source priority, LM updates changing names/selectors, selector-based destination via real ARC integration, selector ref-counting, WEP lifecycle, ignored key types, nil fields, transient overlaps, and selector/direct-name interaction. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… bug Add three calc graph FV test states covering LiveMigration resources: - Local WEP as migration source (by direct name) - Local WEP as migration target (by direct name) - Local WEP as migration target (by selector) These test that the live_migration_role field is correctly set on proto.WorkloadEndpoint messages emitted by the calc graph. Fix a dispatcher ordering bug in LiveMigrationCalculator: the ARC's handler runs before the LMC's on localEndpointDispatcher, so when a WEP update triggers computed selector matching, the OnComputedSelectorMatch callback fires before the LMC has tracked the WEP. Add pendingSelectorMatches to buffer these early matches and apply them when the WEP is subsequently processed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
FSM fixes (live_migration.go): - Rename Done state to TimeWait to match design - Fix transitions: Target+NoRole and Live+NoRole now go to TimeWait (not Live); TimerPop now only in TimeWait state - Add missing transitions for Source and Deleted inputs in Target, Live, and TimeWait states - Fix FSM not being stored in monitor map after creation - Clean up FSM from map when it returns to Base state - Remove unused variable capture in WorkloadEndpointRemove Communication path (new): - Add liveMigrationStateUpdate pseudo-proto message type - FSM emits state changes via monitor's pendingUpdates buffer - InternalDataplane creates/registers the monitor and drains pending updates after each calc graph message fanout - EndpointManager tracks per-endpoint live migration FSM state Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Target state: suppress routes (workload hasn't migrated yet) - Live/TimeWait state: use elevatedRoutePriority - No live migration state: use normalRoutePriority (unchanged) - Re-add endpoint to pendingWlEpUpdates on FSM state change so CompleteDeferredWork re-evaluates routes - Clean up pendingLiveMigrationStates on workload removal Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests cover the full lifecycle: normal priority -> Target (routes suppressed) -> Live (elevated priority) -> TimeWait (elevated priority) -> Base (normal priority reverted). Also tests WEP removal cleanup and state arriving before endpoint. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Exhaustive FSM transition table (24 cases covering all state/input combinations), monitor OnUpdate routing tests, FSM lifecycle management tests, and multi-step scenario tests covering the full live migration lifecycle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e config Add a configurable timer that fires after live migration completes (TimeWait state), reverting routing priority back to normal after routes have had time to converge across the cluster. Default convergence time is 30 seconds. The timer callback fires on a separate goroutine and delivers the workload ID to a channel, which the main dataplane select loop receives and processes synchronously, following the same pattern as ifaceUpdates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Detect gratuitous ARP (sender IP == target IP) and RARP (EtherType 0x8035) packets on workload interfaces using AF_PACKET via pcapgo. When a GARP/RARP is detected, the per-workload FSM transitions from Target to Live, signalling that the VM is now active on this host after live migration. - Add garpHandle interface and pcapgo-based factory for testability - Track interface names from WorkloadEndpoint updates - Launch one-shot detectGARP goroutine per workload entering Target - Close pcap handle on FSM exit from Target (goroutine exits cleanly) - Wire garpC channel into InternalDataplane select loop - Extract dispatchLiveMigrationUpdates helper to deduplicate 3 call sites - Add packet matching tests, channel delivery tests, and lifecycle tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a separate nftables table in the arp family that drops ARP reply packets from the host containing a workload's own IP address. This prevents proxy ARP from responding to a workload's ARP requests for its own IP, which confuses the guest and adds latency during live migration when the guest sends gratuitous ARP. The ARP table is always created regardless of dataplane mode (iptables/nftables/eBPF) since it uses its own nftables table independent of the main dataplane. Per-workload chains contain drop rules for each IPv4 address, dispatched via a verdict map. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds live migration support for KubeVirt VMs to Felix, enabling seamless VM migration between hosts with minimal network disruption. It introduces new calc-graph infrastructure (LiveMigrationCalculator) and a per-workload FSM (liveMigrationMonitor) in the dataplane that tracks migration source/target roles, adjusts route priorities during migration, and suppresses proxy ARP for workload IPs.
Changes:
- New
LiveMigrationCalculatorin the calc graph processesLiveMigrationAPI resources and annotatesWorkloadEndpointproto messages withlive_migration_role(source/target) using both direct name references and label-selector-based matching. - New
liveMigrationMonitorin the Linux dataplane implements a per-workload FSM (Base→Target→Live→TimeWait→Base) that adjusts route priorities, detects migration completion via GARP/RARP packet capture, and suppresses proxy ARP via a new nftables ARP family table. - New Felix configuration parameters (
IPv4/IPv6NormalRoutePriority,IPv4/IPv6ElevatedRoutePriority,LiveMigrationRouteConvergenceTime) exposed throughFelixConfigurationSpec, config params, generated CRDs, and all manifests.
Reviewed changes
Copilot reviewed 46 out of 46 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
felix/calc/live_migration_calculator.go |
New calculator tracking WEP↔LiveMigration correlations, emitting role updates |
felix/calc/live_migration_calculator_test.go |
Comprehensive unit tests for the calculator |
felix/calc/calc_graph.go |
Hooks up LiveMigrationCalculator to the calculation graph |
felix/calc/states_for_test.go / state_test.go / calc_graph_fv_test.go |
Calc graph FV test states and verification for live migration roles |
felix/dataplane/linux/live_migration.go |
Per-workload FSM, GARP detection, timer management |
felix/dataplane/linux/live_migration_test.go |
FSM transition table and scenario tests |
felix/dataplane/linux/endpoint_mgr.go |
Route priority adjustment, ARP chain programming, refactored config struct |
felix/dataplane/linux/endpoint_mgr_test.go / endpoint_mgr_arp_test.go |
Tests for route priority and ARP chains |
felix/dataplane/linux/int_dataplane.go |
Wires up liveMigrationMonitor, ARP table, live migration channels |
felix/nftables/table.go |
New NewARPTable() for ARP family nftables table |
felix/nftables/match_builder.go / generictables/match_builder.go / iptables/match_builder.go |
ARPOperation/ARPSrcIP match methods |
felix/rules/rule_defs.go |
New ARP chain name constants |
felix/proto/felixbackend.proto / .pb.go |
New LiveMigrationRole enum and WorkloadEndpoint field |
felix/config/config_params.go / felix/dataplane/driver.go |
New config params wiring |
api/pkg/apis/projectcalico/v3/felixconfig.go |
New FelixConfigurationSpec fields |
api/config/crd/... / manifests/... / libcalico-go/config/crd/... |
Generated CRD and manifest updates |
libcalico-go/lib/backend/model/workloadendpoint.go |
New GetNameAndNamespace() utility method |
libcalico-go/lib/validator/v3/validator_test.go |
Validation tests for new route priority fields |
felix/dataplane/mock/mock_dataplane.go |
Mock tracking of live migration roles |
felix/dataplane/linux/qos_controls.go |
References updated to use m.cfg.bpfEnabled |
felix/collector/types/endpoint/endpoint.go |
Uses new GetNameAndNamespace() method |
felix/docs/config-params.md / .json |
Documentation for new config params |
| func detectGARP(logCtx *logrus.Entry, id types.WorkloadEndpointID, | ||
| handle garpHandle, garpC chan<- types.WorkloadEndpointID) { | ||
| defer handle.Close() |
There was a problem hiding this comment.
In detectGARP, the goroutine calls defer handle.Close() at the top of the function. However, stopGARPDetection() also calls f.pcapHandle.Close() when the FSM transitions out of GARP detection (e.g., Target→Live, Target→Base). This means Close() gets called twice on the handle — once from stopGARPDetection() and once from the deferred handle.Close() in the goroutine. The fakeGARPHandle.Close() is idempotent and protects against this, but a real *pcapgo.EthernetHandle may not be. Verify that pcapgo.EthernetHandle.Close() is safe to call multiple times, or remove the defer handle.Close() from detectGARP since stopGARPDetection already handles closure.
There was a problem hiding this comment.
You're right, it's not safe; I thought I had looked at this before and concluded it was OK, but it's clearer anyway only to close the handle in one place in stopGARPDetection(). Will fix this.
| func (f *liveMigrationFSM) stopElevatedRoutingTimer() { | ||
| if f.timer != nil { | ||
| f.logCtx.Debug("Stopping elevated routing timer") | ||
| f.timer.Stop() | ||
| f.timer = nil | ||
| } | ||
| } |
There was a problem hiding this comment.
The stopElevatedRoutingTimer() only calls f.timer.Stop() but does not drain any pending delivery from the timer channel. If the timer fires (sends to timerC) just before Stop() is called, the timer ID may still be delivered to timerC and subsequently processed by onLiveMigrationTimerPop. This means the FSM could receive a spurious TimerPop input after the timer has nominally been stopped (e.g., in the TimeWait+Source→Base transition). Consider draining the potential stale value from timerC after stopping, or checking whether the FSM is still in TimeWait state when processing the TimerPop event to make the handler idempotent. Looking at the FSM table, TimeWait+Source→Base stops the timer, but if TimerPop arrives afterward for a now-Base FSM, it will be a no-op (Base+TimerPop→Base). However if the FSM has already been cleaned up from the fsms map, executeFSM will create a new FSM at Base and drive it with TimerPop, which is harmless but noisy.
There was a problem hiding this comment.
Agreed. Will fix in OnTimerPop by taking care not to recreate the FSM.
| // WorkloadID is expected to be in the format "name" or "namespace/name". | ||
| func (key WorkloadEndpointKey) GetNameAndNamespace() (string, string) { | ||
| parts := strings.SplitN(key.WorkloadID, "/", 2) | ||
| if len(parts) == 2 { | ||
| return parts[0] | ||
| return parts[1], parts[0] | ||
| } | ||
| return "" | ||
| return parts[0], "" |
There was a problem hiding this comment.
The GetNamespace() function's documentation comment was updated, but the comment for GetNameAndNamespace() is very terse. The previous GetNamespace had a detailed comment explaining the WorkloadID format. The new GetNameAndNamespace function only has a brief one-liner that doesn't explain the return value ordering. Specifically, the comment should clarify that the function returns (name, namespace) in that order, similar to how the old GetNamespace comment explained the "namespace/name" format.
| for _, tt := range tests { | ||
| t.Run(tt.name, func(t *testing.T) { | ||
| fsm, m := testFSM(tt.from) | ||
| fsm.handleInput(tt.input) | ||
|
|
||
| g.Expect(fsm.currentState).To(Equal(tt.wantState), "FSM state") | ||
| if tt.wantState != tt.from { | ||
| g.Expect(m.pendingUpdates).To(HaveLen(1), "should emit one update") | ||
| g.Expect(m.pendingUpdates[0].State).To(Equal(tt.wantState), "emitted state") | ||
| } else { | ||
| g.Expect(m.pendingUpdates).To(BeEmpty(), "should not emit any updates") | ||
| } | ||
| }) | ||
| } |
There was a problem hiding this comment.
In TestFSMTransitionTable, the outer g := NewWithT(t) is shared across all subtests. The g.Expect calls inside each t.Run use the parent test's g, which means a failure in one subtest will report against the outer t rather than the specific subtest's t. This can make failures harder to diagnose. Each subtest should create its own g := NewWithT(t) inside the t.Run closure.
| const ( | ||
| SOURCE sourceOrTarget = iota | ||
| TARGET | ||
| SOURCE_OR_TARGET | ||
| ) |
There was a problem hiding this comment.
The SOURCE, TARGET, and SOURCE_OR_TARGET constants are exported (capitalized). These are internal implementation details of the LiveMigrationCalculator in the calc package, and exporting them leaks internal concepts to consumers of the package. Consider making them unexported (source, target, sourceOrTarget enum values).
| func (m *liveMigrationMonitor) OnUpdate(protoBufMsg any) { | ||
| switch msg := protoBufMsg.(type) { | ||
| case *proto.WorkloadEndpointUpdate: | ||
| id := types.ProtoToWorkloadEndpointID(msg.GetId()) | ||
| m.ifaceNames[id] = msg.Endpoint.Name | ||
| oldRole := m.roles[id] | ||
| newRole := msg.Endpoint.LiveMigrationRole | ||
| m.roles[id] = newRole | ||
| if oldRole != newRole { | ||
| switch newRole { | ||
| case proto.LiveMigrationRole_NO_ROLE: | ||
| m.executeFSM(id, liveMigrationInputNoRole) | ||
| case proto.LiveMigrationRole_SOURCE: | ||
| m.executeFSM(id, liveMigrationInputSource) | ||
| case proto.LiveMigrationRole_TARGET: | ||
| m.executeFSM(id, liveMigrationInputTarget) | ||
| } | ||
| } | ||
| case *proto.WorkloadEndpointRemove: | ||
| id := types.ProtoToWorkloadEndpointID(msg.GetId()) | ||
| delete(m.roles, id) | ||
| delete(m.ifaceNames, id) | ||
| m.executeFSM(id, liveMigrationInputDeleted) | ||
| } | ||
| } |
There was a problem hiding this comment.
The liveMigrationMonitor is registered as a Manager via dp.RegisterManager(dp.liveMigrationMonitor), meaning it receives all dataplane messages via OnUpdate. However, OnUpdate only handles *proto.WorkloadEndpointUpdate and *proto.WorkloadEndpointRemove — other message types are silently ignored (no default case, no logging). This is expected behavior, but means the liveMigrationMonitor is unnecessarily receiving and discarding a large volume of unrelated messages. Since the monitor only needs WEP updates, consider using a more targeted registration approach (similar to how other calculators register with specific dispatchers in the calc graph), rather than the manager pattern that routes all messages to it.
There was a problem hiding this comment.
I think Copilot here is confusing the typical pattern in the dataplane (which is for all managers to see everything) with the typical pattern in the calc graph (which is for calculators to register for the types they want to process).
| type LiveMigrationRole struct { | ||
| role proto.LiveMigrationRole | ||
| } | ||
|
|
||
| func (lmr *LiveMigrationRole) ApplyTo(wep *proto.WorkloadEndpoint) { | ||
| wep.LiveMigrationRole = lmr.role |
There was a problem hiding this comment.
The LiveMigrationRole struct in felix/calc/live_migration_calculator.go has an unexported role field. The struct itself is exported, making it usable by external packages that receive it via the EndpointComputedDataUpdater callback. However, since the role field is unexported, external code cannot read the value of the role from the struct without using type assertion to access ApplyTo. This is an inconsistency: the type is exported but its data is not accessible. Either export the field or keep the whole struct unexported (since it's only used as an EndpointComputedData interface value).
| options TableOptions, | ||
| required bool, | ||
| ) *NftablesTable { | ||
| ourChainsRegexp := regexp.MustCompile("^(cali|filter)-.*") |
There was a problem hiding this comment.
In NewARPTable, the ourChainsRegexp is defined as "^(cali|filter)-.*". This regex is intended to match chains owned by Calico in the ARP table. The base chain is named "filter-OUTPUT" and workload chains use the WorkloadARPPfx prefix which is "cali-arp-". Both are correctly matched by this regex. However, the dispatch chain is named ChainARPDispatch = "cali-arp-dispatch" (also correctly matched). This looks correct, but the "filter" prefix is included specifically for the "filter-OUTPUT" base chain. This is a subtle coupling — if the base chain naming convention changes, this regex must be updated too. A comment explaining why "filter" is included in the regex would help maintainability.
There was a problem hiding this comment.
I think this comes from porting over from iptables; the unprefixed chains are the top-level hook chains.
There was a problem hiding this comment.
Will add comment. Also spotted a nearby typo.
| for _, mgr := range d.allManagers { | ||
| mgr.OnUpdate(msg) | ||
| } | ||
| // The live migration monitor may have accumulated FSM state changes from the | ||
| // message we just fanned out. Drain them and fan out to all managers as | ||
| // pseudo-proto messages. | ||
| d.dispatchLiveMigrationUpdates() |
There was a problem hiding this comment.
The liveMigrationMonitor implements the Manager interface (via OnUpdate and CompleteDeferredWork), but the live migration state updates it generates are dispatched differently from normal manager flow. After processMsgFromCalcGraph fans out a message to all managers including the liveMigrationMonitor, it immediately drains liveMigrationMonitor.PendingUpdates() and fans those updates out to ALL managers again (including the liveMigrationMonitor itself). This means the liveMigrationMonitor.OnUpdate will receive its own emitted *liveMigrationStateUpdate messages — the switch in OnUpdate doesn't handle this type, so they'll be silently ignored. This is safe but indicates the liveMigrationMonitor shouldn't be in allManagers if it generates its own pseudo-messages that are separately fanned out. This is a design inconsistency that could cause confusion.
There was a problem hiding this comment.
Hm, on balance, Claude and I think it's OK for LiveMigrationMonitor to remain as a manager. Claude's reasoning, which I agree with:
I agree it's fine. The alternative — making it not a manager and special-casing its WEP message delivery — would actually be more confusing and break the pattern that all WEP consumers are managers. Silently ignoring unhandled message types is exactly what every other manager does. The fact that its outputs are fanned out separately is a pragmatic choice: the monitor needs to process WEP updates first, then its derived state updates need to reach the endpoint manager in the same apply cycle. That ordering constraint is the real reason for the two-phase dispatch, not a design inconsistency.
| WorkloadARPPfx = ChainNamePrefix + "arp-" | ||
| NftablesARPDispatchMap = ChainNamePrefix + "arp-dispatch" | ||
| ChainARPDispatch = ChainNamePrefix + "arp-dispatch" |
There was a problem hiding this comment.
NftablesARPDispatchMap and ChainARPDispatch are defined as separate constants but have identical values ("cali-arp-dispatch"). These two constants are redundant — one of them should be removed to avoid confusion about which one to use.
There was a problem hiding this comment.
Arguable, but we have exactly the same in existing code here:
ChainToWorkloadDispatch = ChainNamePrefix + "to-wl-dispatch"
NftablesToWorkloadDispatchMap = ChainNamePrefix + "to-wl-dispatch"
I'd rather follow that existing convention.
The iptables cleanup test asserted that no nftables table name contained "cali" after switching from nftables to iptables mode. This fails now that the calico-arp table (arp family) is always created for ARP suppression. Narrow the assertion to check specifically for the stale "table ip calico" being cleaned up. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
network disruption
source/target roles for workload endpoints
adjusts route priorities during migration and detects migration completion via GARP/RARP
host from answering a VM's ARP probes for its own address
Details
Calc graph: LiveMigrationCalculator
LiveMigrationCalculatorprocesses LiveMigration resources and annotates WorkloadEndpointproto messages with
live_migration_role(source/target)Dataplane: live migration FSM and route priority
liveMigrationMonitormanages per-workload FSMs that track migration state(
LiveMigrationRouteConvergenceTime, default 30s) before reverting to normalDataplane: ARP suppression
arpfamily (calico-arp) with per-workload chainsConfiguration
*RoutePriorityfields (int) — priority metric values for normal and elevated priority routingLiveMigrationRouteConvergenceTime(duration) — time to hold elevated route priority aftermigration completes
Test plan
LiveMigrationCalculator(bug fixes, selector matching, lifecycle)liveMigrationMonitorFSM (24-case transition table, lifecycle, multi-stepscenarios)
ARPOperation,ARPSrcIP)NewARPTable(base chains, family, rule insertion)🤖 Generated with Claude Code
Release note: