NM-300: Clean up stale metrics on node deletion and disconnect#3956
NM-300: Clean up stale metrics on node deletion and disconnect#3956abhishek9686 wants to merge 61 commits intodevelopfrom
Conversation
…to fixes/v1.5.1
…to fixes/v1.5.1
…to fixes/v1.5.1
…to fixes/v1.5.1
…D or schema.AllNetworks (all_networks)
…D or schema.AllNetworks (all_networks)
- Extract cleanupNodeReferences() from DeleteNode to handle best-effort cleanup of relay, internet gw, failover, nameserver, ACL, and egress refs - DisassociateAllNodesFromHost now calls cleanupNodeReferences + DeleteNodeByID directly, bypassing redundant host-association updates that could fail and leave nodes orphaned - Track failed node deletions instead of unconditionally clearing host.Nodes; return error if any nodes couldn't be deleted to prevent host removal - Fix DeleteNode returning error even when node was successfully deleted in the host-not-found path
…path (when purge=true or alreadyDeleted=true), so it won't execute twice — once on soft-delete and again when the zombie manager hard-deletes.
… for any partial failures (upsert errors or undeletable nodes). This allows RemoveHost to always proceed to h.Delete()
* fix(go): set persistent keep alive when registering host using sso; * fix(go): run posture check violations on delete; * fix(go): upsert node on approving pending host; * fix(go): resolve concurrency issues during group delete cleanup; * fix(go): update doc links; * fix(go): add created and updated fields to host; * fix(go): skip delete and update superadmin on sync users; * fix(go): use conn directly for now; * fix(go): remove acl for idp groups; * fix(go): quote fields; * fix(go): use filters with count; * feat(go): add a search query; * fix(go): cleanup acls; * fix(go): review fixes; * fix(go): remove additional loop; * fix(go): fix * v1.5.1: separate out idp sync and reset signals for HA * v1.5.1: add grps with name for logging * v1.5.1: clear posture check violations when all checks are deleted * v1.5.1: set static when default host * v1.5.1: fix db status check * rm set max conns * v1.5.1: reset auto assigned gw when disconnected * fix(go): skip global network admin and user groups when splitting; * v1.5.1: fix update node call from client * fix(go): separate out migration from normal usage; * fix(go): skip default groups; * fix(go): create policies for existing groups on network create; * fix(go): skip fatal log on clickhouse conn; * fix(go): add posture check cleanup; * NM-288: populate relevant name for acl types for UI * NM-288: populate grp names for posture check apis * NM-228: add network grps api * NM-288: add network users api * now check each group's NetworkRoles for either the specific network ID or schema.AllNetworks (all_networks) * NM-288: check and unassign auto gw when node is disconnected from cli * NM-288: optimise network users api call * NM-288: block auto assign when set to use inet gw * NM-288: pass all network scoped get apis for platform users * NM-288: fix orphan nodes on bulk host deletion and refactor node cleanup - Extract cleanupNodeReferences() from DeleteNode to handle best-effort cleanup of relay, internet gw, failover, nameserver, ACL, and egress refs - DisassociateAllNodesFromHost now calls cleanupNodeReferences + DeleteNodeByID directly, bypassing redundant host-association updates that could fail and leave nodes orphaned - Track failed node deletions instead of unconditionally clearing host.Nodes; return error if any nodes couldn't be deleted to prevent host removal - Fix DeleteNode returning error even when node was successfully deleted in the host-not-found path * NM-288: cleanupNodeReferences(node) now only runs in the hard-delete path (when purge=true or alreadyDeleted=true), so it won't execute twice — once on soft-delete and again when the zombie manager hard-deletes. * NM-288: add orphan deletion log * DisassociateAllNodesFromHost now always returns nil, logging warnings for any partial failures (upsert errors or undeletable nodes). This allows RemoveHost to always proceed to h.Delete() * NM-288: move the platform user check in the middleware --------- Co-authored-by: VishalDalwadi <dalwadivishal26@gmail.com> Co-authored-by: Vishal Dalwadi <51291657+VishalDalwadi@users.noreply.github.com>
v1.5.1: cleanup grps from posture checks on grp updates
When a node is deleted or disconnected, peer nodes retained stale metrics (showing the node as connected with outdated latency values). This adds cleanup logic to: - Remove the deleted node's entry from all peers' connectivity maps - Mark the node as disconnected (Connected=false, Latency=999) in peers' metrics when a node goes offline - Run periodic peer metrics cleanup in MetricsMonitor as a safety net for nodes that drop off without a graceful disconnect
|
Tenki Code Review - Complete Files Reviewed: 7 By Severity:
The PR correctly propagates disconnected/deleted node metrics to peers and tightens Files Reviewed (7 files) |
There was a problem hiding this comment.
Overview
This PR introduces two new peer-metrics helpers (SetPeerMetricsDisconnected, DeleteNodeMetricsFromPeers) wired through the existing function-variable dispatch pattern, updates node disconnect flows in controllers/node.go and logic/metrics.go, and adjusts the PlatformUser access-control path in pro/logic/security.go.
Finding 1 — Synchronous call blocks MetricsMonitor ticker (logic/metrics.go)
At logic/metrics.go:58, SetPeerMetricsDisconnected is called without the go keyword inside the metrics monitor's ticker goroutine. The function is O(N²): it fetches all nodes, then for each peer fetches and rewrites its metrics row. Calling it synchronously stalls the entire tick loop for every disconnected node found, potentially causing the monitor to fall far behind its configured interval. Both other call-sites (controllers/node.go:706 and controllers/node.go:894) correctly dispatch it with go.
Finding 2 — Duplicate posture-check cleanup on group update (pro/controllers/users.go)
The new goroutine at pro/controllers/users.go:650–658 iterates removed networks and calls proLogic.RemoveUserGroupFromPostureChecks for each. This is identical logic to what EnsureDefaultUserGroupNetworkPolicies already does at pro/logic/user_mgmt.go:1061–1065 — the same function that is already dispatched on line 647. The result is two concurrent calls to RemoveUserGroupFromPostureChecks per removed network on every group update.
Other Changes (no issues)
connectivity.Latency = 999sentinel on disconnect is consistent across all affected paths.pro/logic/security.go: splitting thePlatformUserGET-allow out of the combinedAuditor || PlatformUserblock and placing it after thenetIDcheck is logically correct —PlatformUserGET requests now correctly require a non-emptynetID, matching the intent of the surrounding checks.pro/initialize.go: new function-variable assignments are correct and follow the existing pattern.
|
|
||
| _ = UpdateMetrics(node.ID.String(), nodeMetrics) | ||
| } | ||
| SetPeerMetricsDisconnected(node.ID.String()) |
There was a problem hiding this comment.
🟠 SetPeerMetricsDisconnected called synchronously in MetricsMonitor loop (bug)
In logic/metrics.go line 58, SetPeerMetricsDisconnected(node.ID.String()) is called without the go keyword inside the metrics monitor's tick goroutine. The function iterates all nodes, fetches each peer's metrics row, and writes them back — an O(N²) database operation that will block the ticker from firing again until it finishes for every disconnected node in the current sweep. Both other call-sites (controllers/node.go:706 and controllers/node.go:894) correctly use go logic.SetPeerMetricsDisconnected(...). The omission here will cause the monitor to fall behind its configured interval in proportion to the number of disconnected nodes and total peer count.
💡 Suggestion: Add the go keyword to match all other call-sites and prevent blocking the ticker loop.
| SetPeerMetricsDisconnected(node.ID.String()) | |
| go SetPeerMetricsDisconnected(node.ID.String()) |
📋 Prompt for AI Agents
In logic/metrics.go at line 58, change SetPeerMetricsDisconnected(node.ID.String()) to go SetPeerMetricsDisconnected(node.ID.String()). The function is an O(N²) DB operation (reads and writes every peer's metrics row); calling it synchronously blocks the MetricsMonitor ticker goroutine and causes it to fall behind its configured interval. All other call-sites (controllers/node.go lines 706 and 894) already use go for this call.
| go func() { | ||
| removedNetworks, _ := proLogic.GetGroupNetworksMap(&currUserG) | ||
| keptNetworks, _ := proLogic.GetGroupNetworksMap(&userGroup) | ||
| for netID := range removedNetworks { | ||
| if _, ok := keptNetworks[netID]; !ok { | ||
| proLogic.RemoveUserGroupFromPostureChecks(userGroup.ID, netID) | ||
| } | ||
| } | ||
| }() |
There was a problem hiding this comment.
🟠 RemoveUserGroupFromPostureChecks called twice per removed network on group update (bug)
The new goroutine added at pro/controllers/users.go:650–658 computes a removedNetworks vs keptNetworks diff and calls proLogic.RemoveUserGroupFromPostureChecks(userGroup.ID, netID) for each network that was removed. However, EnsureDefaultUserGroupNetworkPolicies — already dispatched on line 647 — performs the identical diff at pro/logic/user_mgmt.go:1061–1065 and calls RemoveUserGroupFromPostureChecks for those same networks. Every group update that removes a network now triggers two concurrent calls to RemoveUserGroupFromPostureChecks for each removed network. While the operation is currently idempotent (it only deletes), this is duplicated logic that will silently double-execute if the function ever gains side-effects.
💡 Suggestion: Remove the new goroutine entirely. EnsureDefaultUserGroupNetworkPolicies already handles posture-check cleanup for removed networks. If the intent was a standalone cleanup path, remove the equivalent code from EnsureDefaultUserGroupNetworkPolicies instead to avoid the duplication.
📋 Prompt for AI Agents
In pro/controllers/users.go, remove the goroutine at lines 650-658 (the go func() { ... }() block that calls proLogic.RemoveUserGroupFromPostureChecks). This logic is already executed inside proLogic.EnsureDefaultUserGroupNetworkPolicies (called on line 647): see pro/logic/user_mgmt.go lines 1061-1065, which perform the same removed-network diff and call RemoveUserGroupFromPostureChecks for each. Keeping both causes two concurrent removals per affected network on every group update.
Describe your changes
Provide Issue ticket number if applicable/not in title
Provide testing steps
Checklist before requesting a review