NM-300: Clean up stale metrics on node deletion and disconnect by abhishek9686 · Pull Request #3956 · gravitl/netmaker

abhishek9686 · 2026-04-02T09:27:49Z

Describe your changes

Provide Issue ticket number if applicable/not in title

Provide testing steps

Checklist before requesting a review

My changes affect only 10 files or less.
I have performed a self-review of my code and tested it.
If it is a new feature, I have added thorough tests, my code is <= 1450 lines.
If it is a bugfix, my code is <= 200 lines.
My functions are <= 80 lines.
I have had my code reviewed by a peer.
My unit tests pass locally.
Netmaker is awesome.

…to fixes/v1.5.1

…D or schema.AllNetworks (all_networks)

…-288

- Extract cleanupNodeReferences() from DeleteNode to handle best-effort cleanup of relay, internet gw, failover, nameserver, ACL, and egress refs - DisassociateAllNodesFromHost now calls cleanupNodeReferences + DeleteNodeByID directly, bypassing redundant host-association updates that could fail and leave nodes orphaned - Track failed node deletions instead of unconditionally clearing host.Nodes; return error if any nodes couldn't be deleted to prevent host removal - Fix DeleteNode returning error even when node was successfully deleted in the host-not-found path

…path (when purge=true or alreadyDeleted=true), so it won't execute twice — once on soft-delete and again when the zombie manager hard-deletes.

… for any partial failures (upsert errors or undeletable nodes). This allows RemoveHost to always proceed to h.Delete()

* fix(go): set persistent keep alive when registering host using sso; * fix(go): run posture check violations on delete; * fix(go): upsert node on approving pending host; * fix(go): resolve concurrency issues during group delete cleanup; * fix(go): update doc links; * fix(go): add created and updated fields to host; * fix(go): skip delete and update superadmin on sync users; * fix(go): use conn directly for now; * fix(go): remove acl for idp groups; * fix(go): quote fields; * fix(go): use filters with count; * feat(go): add a search query; * fix(go): cleanup acls; * fix(go): review fixes; * fix(go): remove additional loop; * fix(go): fix * v1.5.1: separate out idp sync and reset signals for HA * v1.5.1: add grps with name for logging * v1.5.1: clear posture check violations when all checks are deleted * v1.5.1: set static when default host * v1.5.1: fix db status check * rm set max conns * v1.5.1: reset auto assigned gw when disconnected * fix(go): skip global network admin and user groups when splitting; * v1.5.1: fix update node call from client * fix(go): separate out migration from normal usage; * fix(go): skip default groups; * fix(go): create policies for existing groups on network create; * fix(go): skip fatal log on clickhouse conn; * fix(go): add posture check cleanup; * NM-288: populate relevant name for acl types for UI * NM-288: populate grp names for posture check apis * NM-228: add network grps api * NM-288: add network users api * now check each group's NetworkRoles for either the specific network ID or schema.AllNetworks (all_networks) * NM-288: check and unassign auto gw when node is disconnected from cli * NM-288: optimise network users api call * NM-288: block auto assign when set to use inet gw * NM-288: pass all network scoped get apis for platform users * NM-288: fix orphan nodes on bulk host deletion and refactor node cleanup - Extract cleanupNodeReferences() from DeleteNode to handle best-effort cleanup of relay, internet gw, failover, nameserver, ACL, and egress refs - DisassociateAllNodesFromHost now calls cleanupNodeReferences + DeleteNodeByID directly, bypassing redundant host-association updates that could fail and leave nodes orphaned - Track failed node deletions instead of unconditionally clearing host.Nodes; return error if any nodes couldn't be deleted to prevent host removal - Fix DeleteNode returning error even when node was successfully deleted in the host-not-found path * NM-288: cleanupNodeReferences(node) now only runs in the hard-delete path (when purge=true or alreadyDeleted=true), so it won't execute twice — once on soft-delete and again when the zombie manager hard-deletes. * NM-288: add orphan deletion log * DisassociateAllNodesFromHost now always returns nil, logging warnings for any partial failures (upsert errors or undeletable nodes). This allows RemoveHost to always proceed to h.Delete() * NM-288: move the platform user check in the middleware --------- Co-authored-by: VishalDalwadi <dalwadivishal26@gmail.com> Co-authored-by: Vishal Dalwadi <51291657+VishalDalwadi@users.noreply.github.com>

v1.5.1: cleanup grps from posture checks on grp updates

When a node is deleted or disconnected, peer nodes retained stale metrics (showing the node as connected with outdated latency values). This adds cleanup logic to: - Remove the deleted node's entry from all peers' connectivity maps - Mark the node as disconnected (Connected=false, Latency=999) in peers' metrics when a node goes offline - Run periodic peer metrics cleanup in MetricsMonitor as a safety net for nodes that drop off without a graceful disconnect

tenki-reviewer · 2026-04-02T09:28:17Z

Tenki Code Review - Complete

Files Reviewed: 7
Findings: 2

By Severity:

🟠 Medium: 2

The PR correctly propagates disconnected/deleted node metrics to peers and tightens PlatformUser network-permission checks, but contains a missing go keyword that causes the metrics monitor to block its own tick loop, and a newly-added goroutine that duplicates posture-check cleanup already performed by an existing helper.

Files Reviewed (7 files)

controllers/node.go
logic/metrics.go
logic/nodes.go
pro/controllers/users.go
pro/initialize.go
pro/logic/metrics.go
pro/logic/security.go

tenki-reviewer

Overview

This PR introduces two new peer-metrics helpers (SetPeerMetricsDisconnected, DeleteNodeMetricsFromPeers) wired through the existing function-variable dispatch pattern, updates node disconnect flows in controllers/node.go and logic/metrics.go, and adjusts the PlatformUser access-control path in pro/logic/security.go.

Finding 1 — Synchronous call blocks MetricsMonitor ticker (`logic/metrics.go`)

At logic/metrics.go:58, SetPeerMetricsDisconnected is called without the go keyword inside the metrics monitor's ticker goroutine. The function is O(N²): it fetches all nodes, then for each peer fetches and rewrites its metrics row. Calling it synchronously stalls the entire tick loop for every disconnected node found, potentially causing the monitor to fall far behind its configured interval. Both other call-sites (controllers/node.go:706 and controllers/node.go:894) correctly dispatch it with go.

Finding 2 — Duplicate posture-check cleanup on group update (`pro/controllers/users.go`)

The new goroutine at pro/controllers/users.go:650–658 iterates removed networks and calls proLogic.RemoveUserGroupFromPostureChecks for each. This is identical logic to what EnsureDefaultUserGroupNetworkPolicies already does at pro/logic/user_mgmt.go:1061–1065 — the same function that is already dispatched on line 647. The result is two concurrent calls to RemoveUserGroupFromPostureChecks per removed network on every group update.

Other Changes (no issues)

connectivity.Latency = 999 sentinel on disconnect is consistent across all affected paths.
pro/logic/security.go: splitting the PlatformUser GET-allow out of the combined Auditor || PlatformUser block and placing it after the netID check is logically correct — PlatformUser GET requests now correctly require a non-empty netID, matching the intent of the surrounding checks.
pro/initialize.go: new function-variable assignments are correct and follow the existing pattern.

tenki-reviewer · 2026-04-02T09:32:57Z


 						_ = UpdateMetrics(node.ID.String(), nodeMetrics)
 					}
+					SetPeerMetricsDisconnected(node.ID.String())


🟠 SetPeerMetricsDisconnected called synchronously in MetricsMonitor loop (bug)

In logic/metrics.go line 58, SetPeerMetricsDisconnected(node.ID.String()) is called without the go keyword inside the metrics monitor's tick goroutine. The function iterates all nodes, fetches each peer's metrics row, and writes them back — an O(N²) database operation that will block the ticker from firing again until it finishes for every disconnected node in the current sweep. Both other call-sites (controllers/node.go:706 and controllers/node.go:894) correctly use go logic.SetPeerMetricsDisconnected(...). The omission here will cause the monitor to fall behind its configured interval in proportion to the number of disconnected nodes and total peer count.

💡 Suggestion: Add the go keyword to match all other call-sites and prevent blocking the ticker loop.

Suggested change

SetPeerMetricsDisconnected(node.ID.String())

go SetPeerMetricsDisconnected(node.ID.String())

📋 Prompt for AI Agents

In logic/metrics.go at line 58, change SetPeerMetricsDisconnected(node.ID.String()) to go SetPeerMetricsDisconnected(node.ID.String()). The function is an O(N²) DB operation (reads and writes every peer's metrics row); calling it synchronously blocks the MetricsMonitor ticker goroutine and causes it to fall behind its configured interval. All other call-sites (controllers/node.go lines 706 and 894) already use go for this call.

tenki-reviewer · 2026-04-02T09:32:58Z

+	go func() {
+		removedNetworks, _ := proLogic.GetGroupNetworksMap(&currUserG)
+		keptNetworks, _ := proLogic.GetGroupNetworksMap(&userGroup)
+		for netID := range removedNetworks {
+			if _, ok := keptNetworks[netID]; !ok {
+				proLogic.RemoveUserGroupFromPostureChecks(userGroup.ID, netID)
+			}
+		}
+	}()


🟠 RemoveUserGroupFromPostureChecks called twice per removed network on group update (bug)

The new goroutine added at pro/controllers/users.go:650–658 computes a removedNetworks vs keptNetworks diff and calls proLogic.RemoveUserGroupFromPostureChecks(userGroup.ID, netID) for each network that was removed. However, EnsureDefaultUserGroupNetworkPolicies — already dispatched on line 647 — performs the identical diff at pro/logic/user_mgmt.go:1061–1065 and calls RemoveUserGroupFromPostureChecks for those same networks. Every group update that removes a network now triggers two concurrent calls to RemoveUserGroupFromPostureChecks for each removed network. While the operation is currently idempotent (it only deletes), this is duplicated logic that will silently double-execute if the function ever gains side-effects.

💡 Suggestion: Remove the new goroutine entirely. EnsureDefaultUserGroupNetworkPolicies already handles posture-check cleanup for removed networks. If the intent was a standalone cleanup path, remove the equivalent code from EnsureDefaultUserGroupNetworkPolicies instead to avoid the duplication.

📋 Prompt for AI Agents

In pro/controllers/users.go, remove the goroutine at lines 650-658 (the go func() { ... }() block that calls proLogic.RemoveUserGroupFromPostureChecks). This logic is already executed inside proLogic.EnsureDefaultUserGroupNetworkPolicies (called on line 647): see pro/logic/user_mgmt.go lines 1061-1065, which perform the same removed-network diff and call RemoveUserGroupFromPostureChecks for each. Keeping both causes two concurrent removals per affected network on every group update.

VishalDalwadi and others added 30 commits March 19, 2026 14:02

fix(go): set persistent keep alive when registering host using sso;

30bf191

Merge branch 'develop' into fixes/v1.5.1

1d9d23d

fix(go): run posture check violations on delete;

9cb80e6

fix(go): upsert node on approving pending host;

ee10a13

Merge branch 'develop' into fixes/v1.5.1

3600e92

fix(go): resolve concurrency issues during group delete cleanup;

d0a1fff

Merge branch 'fixes/v1.5.1' of https://github.com/gravitl/netmaker in…

b07255a

…to fixes/v1.5.1

fix(go): update doc links;

1745f80

fix(go): add created and updated fields to host;

ff0de2c

fix(go): skip delete and update superadmin on sync users;

03a4a2a

fix(go): use conn directly for now;

7507d35

Merge branch 'develop' into fixes/v1.5.1

24cdcc5

fix(go): remove acl for idp groups;

a27a6b2

Merge branch 'fixes/v1.5.1' of https://github.com/gravitl/netmaker in…

010d2d4

…to fixes/v1.5.1

fix(go): quote fields;

78679cb

fix(go): use filters with count;

3c8b9b0

feat(go): add a search query;

aff1319

resolve merge conflicts

d09630e

fix(go): cleanup acls;

f98c6ba

Merge branch 'fixes/v1.5.1' of https://github.com/gravitl/netmaker in…

5f4b123

…to fixes/v1.5.1

fix(go): review fixes;

85d68cf

fix(go): remove additional loop;

224f857

fix(go): fix

a925467

v1.5.1: separate out idp sync and reset signals for HA

086ccc8

v1.5.1: add grps with name for logging

316f82d

v1.5.1: clear posture check violations when all checks are deleted

9988287

v1.5.1: set static when default host

11751f5

v1.5.1: fix db status check

ea4f1eb

rm set max conns

5bbc301

v1.5.1: reset auto assigned gw when disconnected

fe98780

VishalDalwadi and others added 28 commits March 27, 2026 12:52

fix(go): skip default groups;

5b7bde1

fix(go): create policies for existing groups on network create;

ab9feec

Merge branch 'fixes/v1.5.1' of https://github.com/gravitl/netmaker in…

87849ff

…to fixes/v1.5.1

fix(go): skip fatal log on clickhouse conn;

6ad1409

fix(go): add posture check cleanup;

fe8281d

NM-288: populate relevant name for acl types for UI

6437429

NM-288: populate grp names for posture check apis

7d2d7da

NM-228: add network grps api

813d784

NM-288: add network users api

9f3894c

Merge branch 'develop' into NM-288

29a5a37

now check each group's NetworkRoles for either the specific network I…

76e6420

…D or schema.AllNetworks (all_networks)

now check each group's NetworkRoles for either the specific network I…

e0d9e15

…D or schema.AllNetworks (all_networks)

NM-288: check and unassign auto gw when node is disconnected from cli

42fdda9

NM-288: optimise network users api call

9925de4

NM-288: block auto assign when set to use inet gw

4fe4381

Merge branch 'develop' of https://github.com/gravitl/netmaker into NM…

2fb9239

…-288

NM-288: pass all network scoped get apis for platform users

c721ce4

Merge branch 'develop' of https://github.com/gravitl/netmaker into NM…

9514a3f

…-288

NM-288: cleanupNodeReferences(node) now only runs in the hard-delete …

d6b3470

…path (when purge=true or alreadyDeleted=true), so it won't execute twice — once on soft-delete and again when the zombie manager hard-deletes.

NM-288: add orphan deletion log

c532491

DisassociateAllNodesFromHost now always returns nil, logging warnings…

688f1c3

… for any partial failures (upsert errors or undeletable nodes). This allows RemoveHost to always proceed to h.Delete()

NM-288: move the platform user check in the middleware

79efb34

Merge branch 'release-v1.5.1' into NM-288

bd2f9e3

v1.5.1: cleanup grps from posture checks on grp updates

8fafa55

Merge pull request #3953 from gravitl/NM-288

22995c1

v1.5.1: cleanup grps from posture checks on grp updates

tenki-reviewer Bot reviewed Apr 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NM-300: Clean up stale metrics on node deletion and disconnect#3956

NM-300: Clean up stale metrics on node deletion and disconnect#3956
abhishek9686 wants to merge 61 commits intodevelopfrom
NM-300

abhishek9686 commented Apr 2, 2026

Uh oh!

tenki-reviewer Bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

tenki-reviewer Bot left a comment

Uh oh!

tenki-reviewer Bot Apr 2, 2026

Uh oh!

tenki-reviewer Bot Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	SetPeerMetricsDisconnected(node.ID.String())
	go SetPeerMetricsDisconnected(node.ID.String())

Conversation

abhishek9686 commented Apr 2, 2026

Describe your changes

Provide Issue ticket number if applicable/not in title

Provide testing steps

Checklist before requesting a review

Uh oh!

tenki-reviewer Bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tenki-reviewer Bot left a comment

Choose a reason for hiding this comment

Overview

Finding 1 — Synchronous call blocks MetricsMonitor ticker (logic/metrics.go)

Finding 2 — Duplicate posture-check cleanup on group update (pro/controllers/users.go)

Other Changes (no issues)

Uh oh!

tenki-reviewer Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

tenki-reviewer Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tenki-reviewer Bot commented Apr 2, 2026 •

edited

Loading

Finding 1 — Synchronous call blocks MetricsMonitor ticker (`logic/metrics.go`)

Finding 2 — Duplicate posture-check cleanup on group update (`pro/controllers/users.go`)