Skip to content

NM-300: Clean up stale metrics on node deletion and disconnect#3956

Open
abhishek9686 wants to merge 61 commits intodevelopfrom
NM-300
Open

NM-300: Clean up stale metrics on node deletion and disconnect#3956
abhishek9686 wants to merge 61 commits intodevelopfrom
NM-300

Conversation

@abhishek9686
Copy link
Copy Markdown
Member

Describe your changes

Provide Issue ticket number if applicable/not in title

Provide testing steps

Checklist before requesting a review

  • My changes affect only 10 files or less.
  • I have performed a self-review of my code and tested it.
  • If it is a new feature, I have added thorough tests, my code is <= 1450 lines.
  • If it is a bugfix, my code is <= 200 lines.
  • My functions are <= 80 lines.
  • I have had my code reviewed by a peer.
  • My unit tests pass locally.
  • Netmaker is awesome.

VishalDalwadi and others added 30 commits March 19, 2026 14:02
VishalDalwadi and others added 28 commits March 27, 2026 12:52
- Extract cleanupNodeReferences() from DeleteNode to handle best-effort
  cleanup of relay, internet gw, failover, nameserver, ACL, and egress refs
- DisassociateAllNodesFromHost now calls cleanupNodeReferences + DeleteNodeByID
  directly, bypassing redundant host-association updates that could fail and
  leave nodes orphaned
- Track failed node deletions instead of unconditionally clearing host.Nodes;
  return error if any nodes couldn't be deleted to prevent host removal
- Fix DeleteNode returning error even when node was successfully deleted in
  the host-not-found path
…path (when purge=true or alreadyDeleted=true), so it won't execute twice — once on soft-delete and again when the zombie manager hard-deletes.
… for any partial failures (upsert errors or undeletable nodes). This allows RemoveHost to always proceed to h.Delete()
* fix(go): set persistent keep alive when registering host using sso;

* fix(go): run posture check violations on delete;

* fix(go): upsert node on approving pending host;

* fix(go): resolve concurrency issues during group delete cleanup;

* fix(go): update doc links;

* fix(go): add created and updated fields to host;

* fix(go): skip delete and update superadmin on sync users;

* fix(go): use conn directly for now;

* fix(go): remove acl for idp groups;

* fix(go): quote fields;

* fix(go): use filters with count;

* feat(go): add a search query;

* fix(go): cleanup acls;

* fix(go): review fixes;

* fix(go): remove additional loop;

* fix(go): fix

* v1.5.1: separate out idp sync and reset signals for HA

* v1.5.1: add grps with name for logging

* v1.5.1: clear posture check violations when all checks are deleted

* v1.5.1: set static when default host

* v1.5.1: fix db status check

* rm set max conns

* v1.5.1: reset auto assigned gw when disconnected

* fix(go): skip global network admin and user groups when splitting;

* v1.5.1: fix update node call from client

* fix(go): separate out migration from normal usage;

* fix(go): skip default groups;

* fix(go): create policies for existing groups on network create;

* fix(go): skip fatal log on clickhouse conn;

* fix(go): add posture check cleanup;

* NM-288: populate relevant name for acl types for UI

* NM-288: populate grp names for posture check apis

* NM-228: add network grps api

* NM-288: add network users api

* now check each group's NetworkRoles for either the specific network ID or schema.AllNetworks (all_networks)

* NM-288: check and unassign auto gw when node is disconnected from cli

* NM-288: optimise network users api call

* NM-288: block auto assign when set to use inet gw

* NM-288: pass all network scoped get apis for platform users

* NM-288: fix orphan nodes on bulk host deletion and refactor node cleanup

- Extract cleanupNodeReferences() from DeleteNode to handle best-effort
  cleanup of relay, internet gw, failover, nameserver, ACL, and egress refs
- DisassociateAllNodesFromHost now calls cleanupNodeReferences + DeleteNodeByID
  directly, bypassing redundant host-association updates that could fail and
  leave nodes orphaned
- Track failed node deletions instead of unconditionally clearing host.Nodes;
  return error if any nodes couldn't be deleted to prevent host removal
- Fix DeleteNode returning error even when node was successfully deleted in
  the host-not-found path

* NM-288: cleanupNodeReferences(node) now only runs in the hard-delete path (when purge=true or alreadyDeleted=true), so it won't execute twice — once on soft-delete and again when the zombie manager hard-deletes.

* NM-288: add orphan deletion log

* DisassociateAllNodesFromHost now always returns nil, logging warnings for any partial failures (upsert errors or undeletable nodes). This allows RemoveHost to always proceed to h.Delete()

* NM-288: move the platform user check in the middleware

---------

Co-authored-by: VishalDalwadi <dalwadivishal26@gmail.com>
Co-authored-by: Vishal Dalwadi <51291657+VishalDalwadi@users.noreply.github.com>
v1.5.1: cleanup grps from posture checks on grp updates
When a node is deleted or disconnected, peer nodes retained stale
metrics (showing the node as connected with outdated latency values).
This adds cleanup logic to:
- Remove the deleted node's entry from all peers' connectivity maps
- Mark the node as disconnected (Connected=false, Latency=999) in
  peers' metrics when a node goes offline
- Run periodic peer metrics cleanup in MetricsMonitor as a safety net
  for nodes that drop off without a graceful disconnect
@tenki-reviewer
Copy link
Copy Markdown
Contributor

tenki-reviewer Bot commented Apr 2, 2026

Tenki Code Review - Complete

Files Reviewed: 7
Findings: 2

By Severity:

  • 🟠 Medium: 2

The PR correctly propagates disconnected/deleted node metrics to peers and tightens PlatformUser network-permission checks, but contains a missing go keyword that causes the metrics monitor to block its own tick loop, and a newly-added goroutine that duplicates posture-check cleanup already performed by an existing helper.

Files Reviewed (7 files)
controllers/node.go
logic/metrics.go
logic/nodes.go
pro/controllers/users.go
pro/initialize.go
pro/logic/metrics.go
pro/logic/security.go

Copy link
Copy Markdown
Contributor

@tenki-reviewer tenki-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview

This PR introduces two new peer-metrics helpers (SetPeerMetricsDisconnected, DeleteNodeMetricsFromPeers) wired through the existing function-variable dispatch pattern, updates node disconnect flows in controllers/node.go and logic/metrics.go, and adjusts the PlatformUser access-control path in pro/logic/security.go.

Finding 1 — Synchronous call blocks MetricsMonitor ticker (logic/metrics.go)

At logic/metrics.go:58, SetPeerMetricsDisconnected is called without the go keyword inside the metrics monitor's ticker goroutine. The function is O(N²): it fetches all nodes, then for each peer fetches and rewrites its metrics row. Calling it synchronously stalls the entire tick loop for every disconnected node found, potentially causing the monitor to fall far behind its configured interval. Both other call-sites (controllers/node.go:706 and controllers/node.go:894) correctly dispatch it with go.

Finding 2 — Duplicate posture-check cleanup on group update (pro/controllers/users.go)

The new goroutine at pro/controllers/users.go:650–658 iterates removed networks and calls proLogic.RemoveUserGroupFromPostureChecks for each. This is identical logic to what EnsureDefaultUserGroupNetworkPolicies already does at pro/logic/user_mgmt.go:1061–1065 — the same function that is already dispatched on line 647. The result is two concurrent calls to RemoveUserGroupFromPostureChecks per removed network on every group update.

Other Changes (no issues)

  • connectivity.Latency = 999 sentinel on disconnect is consistent across all affected paths.
  • pro/logic/security.go: splitting the PlatformUser GET-allow out of the combined Auditor || PlatformUser block and placing it after the netID check is logically correct — PlatformUser GET requests now correctly require a non-empty netID, matching the intent of the surrounding checks.
  • pro/initialize.go: new function-variable assignments are correct and follow the existing pattern.

Comment thread logic/metrics.go

_ = UpdateMetrics(node.ID.String(), nodeMetrics)
}
SetPeerMetricsDisconnected(node.ID.String())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 SetPeerMetricsDisconnected called synchronously in MetricsMonitor loop (bug)

In logic/metrics.go line 58, SetPeerMetricsDisconnected(node.ID.String()) is called without the go keyword inside the metrics monitor's tick goroutine. The function iterates all nodes, fetches each peer's metrics row, and writes them back — an O(N²) database operation that will block the ticker from firing again until it finishes for every disconnected node in the current sweep. Both other call-sites (controllers/node.go:706 and controllers/node.go:894) correctly use go logic.SetPeerMetricsDisconnected(...). The omission here will cause the monitor to fall behind its configured interval in proportion to the number of disconnected nodes and total peer count.

💡 Suggestion: Add the go keyword to match all other call-sites and prevent blocking the ticker loop.

Suggested change
SetPeerMetricsDisconnected(node.ID.String())
go SetPeerMetricsDisconnected(node.ID.String())
📋 Prompt for AI Agents

In logic/metrics.go at line 58, change SetPeerMetricsDisconnected(node.ID.String()) to go SetPeerMetricsDisconnected(node.ID.String()). The function is an O(N²) DB operation (reads and writes every peer's metrics row); calling it synchronously blocks the MetricsMonitor ticker goroutine and causes it to fall behind its configured interval. All other call-sites (controllers/node.go lines 706 and 894) already use go for this call.

Comment thread pro/controllers/users.go
Comment on lines +650 to +658
go func() {
removedNetworks, _ := proLogic.GetGroupNetworksMap(&currUserG)
keptNetworks, _ := proLogic.GetGroupNetworksMap(&userGroup)
for netID := range removedNetworks {
if _, ok := keptNetworks[netID]; !ok {
proLogic.RemoveUserGroupFromPostureChecks(userGroup.ID, netID)
}
}
}()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 RemoveUserGroupFromPostureChecks called twice per removed network on group update (bug)

The new goroutine added at pro/controllers/users.go:650–658 computes a removedNetworks vs keptNetworks diff and calls proLogic.RemoveUserGroupFromPostureChecks(userGroup.ID, netID) for each network that was removed. However, EnsureDefaultUserGroupNetworkPolicies — already dispatched on line 647 — performs the identical diff at pro/logic/user_mgmt.go:1061–1065 and calls RemoveUserGroupFromPostureChecks for those same networks. Every group update that removes a network now triggers two concurrent calls to RemoveUserGroupFromPostureChecks for each removed network. While the operation is currently idempotent (it only deletes), this is duplicated logic that will silently double-execute if the function ever gains side-effects.

💡 Suggestion: Remove the new goroutine entirely. EnsureDefaultUserGroupNetworkPolicies already handles posture-check cleanup for removed networks. If the intent was a standalone cleanup path, remove the equivalent code from EnsureDefaultUserGroupNetworkPolicies instead to avoid the duplication.

📋 Prompt for AI Agents

In pro/controllers/users.go, remove the goroutine at lines 650-658 (the go func() { ... }() block that calls proLogic.RemoveUserGroupFromPostureChecks). This logic is already executed inside proLogic.EnsureDefaultUserGroupNetworkPolicies (called on line 647): see pro/logic/user_mgmt.go lines 1061-1065, which perform the same removed-network diff and call RemoveUserGroupFromPostureChecks for each. Keeping both causes two concurrent removals per affected network on every group update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants