Skip to content

Commit 657975d

Browse files
MrFreezeexjulianwiedmann
authored andcommitted
clustermesh: remove self reporting of cluster/node name
Those metrics should be directly inferred by users' Prometheus config. When Cilium installs ServiceMonitor we in fact already add nodes and we can pretty much expect users with multiple clusters to add their own label to differentiate clusters. Note that this is not removing the source_cluster label everywhere because in KVStoreMesh it has a real meaning! Signed-off-by: Arthur Outhenin-Chalandre <git@mrfreezeex.fr>
1 parent fc32034 commit 657975d

File tree

12 files changed

+85
-92
lines changed

12 files changed

+85
-92
lines changed

Documentation/observability/metrics.rst

Lines changed: 39 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -514,17 +514,17 @@ Name Labels
514514
Clustermesh
515515
~~~~~~~~~~~
516516

517-
=============================================== ============================================================ ========== =================================================================
518-
Name Labels Default Description
519-
=============================================== ============================================================ ========== =================================================================
520-
``clustermesh_remote_cluster_services`` ``source_cluster``, ``source_node_name``, ``target_cluster`` Enabled The total number of services per remote cluster
521-
``clustermesh_remote_cluster_endpoints`` ``source_cluster``, ``source_node_name``, ``target_cluster`` Enabled The total number of endpoints per remote cluster
522-
``clustermesh_remote_cluster_nodes`` ``source_cluster``, ``source_node_name``, ``target_cluster`` Enabled The total number of nodes per remote cluster
523-
``clustermesh_remote_clusters`` ``source_cluster``, ``source_node_name`` Enabled The total number of remote clusters meshed with the local cluster
524-
``clustermesh_remote_cluster_failures`` ``source_cluster``, ``source_node_name``, ``target_cluster`` Enabled The total number of failures related to the remote cluster
525-
``clustermesh_remote_cluster_last_failure_ts`` ``source_cluster``, ``source_node_name``, ``target_cluster`` Enabled The timestamp of the last failure of the remote cluster
526-
``clustermesh_remote_cluster_readiness_status`` ``source_cluster``, ``source_node_name``, ``target_cluster`` Enabled The readiness status of the remote cluster
527-
=============================================== ============================================================ ========== =================================================================
517+
=============================================== ================== ========== =================================================================
518+
Name Labels Default Description
519+
=============================================== ================== ========== =================================================================
520+
``clustermesh_remote_cluster_services`` ``target_cluster`` Enabled The total number of services per remote cluster
521+
``clustermesh_remote_cluster_endpoints`` ``target_cluster`` Enabled The total number of endpoints per remote cluster
522+
``clustermesh_remote_cluster_nodes`` ``target_cluster`` Enabled The total number of nodes per remote cluster
523+
``clustermesh_remote_clusters`` Enabled The total number of remote clusters meshed with the local cluster
524+
``clustermesh_remote_cluster_failures`` ``target_cluster`` Enabled The total number of failures related to the remote cluster
525+
``clustermesh_remote_cluster_last_failure_ts`` ``target_cluster`` Enabled The timestamp of the last failure of the remote cluster
526+
``clustermesh_remote_cluster_readiness_status`` ``target_cluster`` Enabled The readiness status of the remote cluster
527+
=============================================== ================== ========== =================================================================
528528

529529
Datapath
530530
~~~~~~~~
@@ -1019,16 +1019,16 @@ Name Labels
10191019
Clustermesh
10201020
~~~~~~~~~~~
10211021

1022-
=============================================== ============================================================ ========== ==================================================================
1023-
Name Labels Default Description
1024-
=============================================== ============================================================ ========== ==================================================================
1025-
``clustermesh_remote_clusters`` ``source_cluster``, ``source_node_name`` Enabled The total number of remote clusters meshed with the local cluster
1026-
``clustermesh_remote_cluster_failures`` ``source_cluster``, ``source_node_name``, ``target_cluster`` Enabled The total number of failures related to the remote cluster
1027-
``clustermesh_remote_cluster_last_failure_ts`` ``source_cluster``, ``source_node_name``, ``target_cluster`` Enabled The timestamp of the last failure of the remote cluster
1028-
``clustermesh_remote_cluster_readiness_status`` ``source_cluster``, ``source_node_name``, ``target_cluster`` Enabled The readiness status of the remote cluster
1029-
``clustermesh_remote_cluster_services`` ``source_cluster``, ``target_cluster`` Enabled The total number of services per remote cluster
1030-
``clustermesh_remote_cluster_service_exports`` ``source_cluster``, ``target_cluster`` Enabled The total number of MCS-API service exports per remote cluster
1031-
=============================================== ============================================================ ========== ==================================================================
1022+
=============================================== ================== ========== ==================================================================
1023+
Name Labels Default Description
1024+
=============================================== ================== ========== ==================================================================
1025+
``clustermesh_remote_clusters`` Enabled The total number of remote clusters meshed with the local cluster
1026+
``clustermesh_remote_cluster_failures`` ``target_cluster`` Enabled The total number of failures related to the remote cluster
1027+
``clustermesh_remote_cluster_last_failure_ts`` ``target_cluster`` Enabled The timestamp of the last failure of the remote cluster
1028+
``clustermesh_remote_cluster_readiness_status`` ``target_cluster`` Enabled The readiness status of the remote cluster
1029+
``clustermesh_remote_cluster_services`` ``target_cluster`` Enabled The total number of services per remote cluster
1030+
``clustermesh_remote_cluster_service_exports`` ``target_cluster`` Enabled The total number of MCS-API service exports per remote cluster
1031+
=============================================== ================== ========== ==================================================================
10321032

10331033

10341034
Hubble
@@ -1478,11 +1478,11 @@ Prometheus namespace.
14781478
Bootstrap
14791479
~~~~~~~~~
14801480

1481-
======================================== ============================================ ========================================================
1482-
Name Labels Description
1483-
======================================== ============================================ ========================================================
1484-
``bootstrap_seconds`` ``source_cluster`` Duration in seconds to complete bootstrap
1485-
======================================== ============================================ ========================================================
1481+
======================================== ========================================================
1482+
Name Description
1483+
======================================== ========================================================
1484+
``bootstrap_seconds`` Duration in seconds to complete bootstrap
1485+
======================================== ========================================================
14861486

14871487
KVstore
14881488
~~~~~~~
@@ -1550,11 +1550,11 @@ All metrics are exported under the ``cilium_kvstoremesh_`` Prometheus namespace.
15501550
Bootstrap
15511551
~~~~~~~~~
15521552

1553-
======================================== ============================================ ========================================================
1554-
Name Labels Description
1555-
======================================== ============================================ ========================================================
1556-
``bootstrap_seconds`` ``source_cluster`` Duration in seconds to complete bootstrap
1557-
======================================== ============================================ ========================================================
1553+
======================================== ========================================================
1554+
Name Description
1555+
======================================== ========================================================
1556+
``bootstrap_seconds`` Duration in seconds to complete bootstrap
1557+
======================================== ========================================================
15581558

15591559
KVStoremesh
15601560
~~~~~~~~~~~
@@ -1570,14 +1570,14 @@ Clustermesh
15701570

15711571
Note that these metrics are not prefixed by ``clustermesh_``.
15721572

1573-
=============================================== ============================================================ ==================================================================
1574-
Name Labels Description
1575-
=============================================== ============================================================ ==================================================================
1576-
``remote_clusters`` ``source_cluster``, ``source_node_name`` The total number of remote clusters meshed with the local cluster
1577-
``remote_cluster_failures`` ``source_cluster``, ``source_node_name``, ``target_cluster`` The total number of failures related to the remote cluster
1578-
``remote_cluster_last_failure_ts`` ``source_cluster``, ``source_node_name``, ``target_cluster`` The timestamp of the last failure of the remote cluster
1579-
``remote_cluster_readiness_status`` ``source_cluster``, ``source_node_name``, ``target_cluster`` The readiness status of the remote cluster
1580-
=============================================== ============================================================ ==================================================================
1573+
=============================================== ================== ==================================================================
1574+
Name Labels Description
1575+
=============================================== ================== ==================================================================
1576+
``remote_clusters`` The total number of remote clusters meshed with the local cluster
1577+
``remote_cluster_failures`` ``target_cluster`` The total number of failures related to the remote cluster
1578+
``remote_cluster_last_failure_ts`` ``target_cluster`` The timestamp of the last failure of the remote cluster
1579+
``remote_cluster_readiness_status`` ``target_cluster`` The readiness status of the remote cluster
1580+
=============================================== ================== ==================================================================
15811581

15821582
KVstore
15831583
~~~~~~~

Documentation/operations/upgrade.rst

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -317,6 +317,8 @@ communicating via the proxy must reconnect to re-establish connections.
317317
This may change in future. See :gh-issue:`35823`
318318
and :gh-issue:`17177` for further discussion on this topic.
319319
* MCS-API CRDs need to be updated, see the MCS-API :ref:`clustermesh_mcsapi_prereqs` for updated CRD links.
320+
* Cilium will stop reporting its local cluster name and node name in metrics. Users relying on those
321+
should configure their metrics collection system to add similar labels instead.
320322

321323
Removed Options
322324
~~~~~~~~~~~~~~~
@@ -398,6 +400,15 @@ now report per cluster metric instead of a "global" count and were renamed to re
398400
The following metrics no longer reports a ``source_cluster`` and a ``source_node_name`` label:
399401
* ``node_health_connectivity_status``
400402
* ``node_health_connectivity_latency_seconds``
403+
* ``bootstrap_seconds``
404+
* ``*_remote_clusters``
405+
* ``*_remote_cluster_last_failure_ts``
406+
* ``*_remote_cluster_readiness_status``
407+
* ``*_remote_cluster_failures``
408+
* ``*_remote_cluster_nodes``
409+
* ``*_remote_cluster_services``
410+
* ``*_remote_cluster_endpoints``
411+
* ``cilium_operator_clustermesh_remote_cluster_service_exports``
401412

402413

403414
Deprecated Metrics

clustermesh-apiserver/syncstate/syncstate.go

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,6 @@ import (
88

99
"github.com/cilium/hive/cell"
1010

11-
"github.com/cilium/cilium/pkg/clustermesh/types"
1211
"github.com/cilium/cilium/pkg/lock"
1312
"github.com/cilium/cilium/pkg/metrics"
1413
"github.com/cilium/cilium/pkg/metrics/metric"
@@ -23,13 +22,13 @@ var Cell = cell.Module(
2322
cell.Provide(new),
2423
)
2524

26-
func new(lc cell.Lifecycle, metrics Metrics, clusterInfo types.ClusterInfo) SyncState {
25+
func new(lc cell.Lifecycle, metrics Metrics) SyncState {
2726
ss := SyncState{StoppableWaitGroup: lock.NewStoppableWaitGroup()}
2827

2928
go func() {
3029
syncTime := spanstat.Start()
3130
<-ss.WaitChannel()
32-
metrics.BootstrapDuration.WithLabelValues(clusterInfo.Name).Set(syncTime.Seconds())
31+
metrics.BootstrapDuration.Set(syncTime.Seconds())
3332
}()
3433
return ss
3534
}
@@ -63,15 +62,15 @@ func (ss SyncState) WaitForResource() func(context.Context) {
6362
// clustermesh-apiserver or kvstoremesh.
6463
type Metrics struct {
6564
// BootstrapDuration tracks the duration in seconds until ready to serve requests.
66-
BootstrapDuration metric.Vec[metric.Gauge]
65+
BootstrapDuration metric.Gauge
6766
}
6867

6968
func MetricsProvider() Metrics {
7069
return Metrics{
71-
BootstrapDuration: metric.NewGaugeVec(metric.GaugeOpts{
70+
BootstrapDuration: metric.NewGauge(metric.GaugeOpts{
7271
Namespace: metrics.Namespace,
7372
Name: "bootstrap_seconds",
7473
Help: "Duration in seconds to complete bootstrap",
75-
}, []string{metrics.LabelSourceCluster}),
74+
}),
7675
}
7776
}

pkg/clustermesh/clustermesh.go

Lines changed: 5 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,6 @@ import (
2525
"github.com/cilium/cilium/pkg/kvstore/store"
2626
"github.com/cilium/cilium/pkg/logging/logfields"
2727
nodeStore "github.com/cilium/cilium/pkg/node/store"
28-
nodeTypes "github.com/cilium/cilium/pkg/node/types"
2928
"github.com/cilium/cilium/pkg/source"
3029
)
3130

@@ -37,7 +36,7 @@ type Configuration struct {
3736
common.Config
3837
wait.TimeoutConfig
3938

40-
// ClusterInfo is the id/name of the local cluster. This is used for logging and metrics
39+
// ClusterInfo is the id/name of the local cluster.
4140
ClusterInfo cmtypes.ClusterInfo
4241

4342
// RemoteClientFactory is the factory to create new backend instances.
@@ -112,9 +111,6 @@ type ClusterMesh struct {
112111
// is protected by its own mutex inside the structure.
113112
globalServices *common.GlobalServiceCache
114113

115-
// nodeName is the name of the local node. This is used for logging and metrics
116-
nodeName string
117-
118114
// syncTimeoutLogOnce ensures that the warning message triggered upon failure
119115
// waiting for remote clusters synchronization is output only once.
120116
syncTimeoutLogOnce sync.Once
@@ -130,10 +126,8 @@ func NewClusterMesh(lifecycle cell.Lifecycle, c Configuration) *ClusterMesh {
130126
return nil
131127
}
132128

133-
nodeName := nodeTypes.GetName()
134129
cm := &ClusterMesh{
135130
conf: c,
136-
nodeName: nodeName,
137131
globalServices: common.NewGlobalServiceCache(c.Logger),
138132
FeatureMetrics: c.FeatureMetrics,
139133
}
@@ -157,8 +151,7 @@ func NewClusterMesh(lifecycle cell.Lifecycle, c Configuration) *ClusterMesh {
157151

158152
NewRemoteCluster: cm.NewRemoteCluster,
159153

160-
NodeName: nodeName,
161-
Metrics: c.CommonMetrics,
154+
Metrics: c.CommonMetrics,
162155
})
163156

164157
lifecycle.Append(cm.common)
@@ -188,7 +181,7 @@ func (cm *ClusterMesh) NewRemoteCluster(name string, status common.StatusFunc) c
188181
),
189182
nodeStore.NewNodeObserver(cm.conf.NodeObserver, source.ClusterMesh),
190183
store.RWSWithOnSyncCallback(func(ctx context.Context) { close(rc.synced.nodes) }),
191-
store.RWSWithEntriesMetric(cm.conf.Metrics.TotalNodes.WithLabelValues(cm.conf.ClusterInfo.Name, cm.nodeName, rc.name)),
184+
store.RWSWithEntriesMetric(cm.conf.Metrics.TotalNodes.WithLabelValues(rc.name)),
192185
)
193186

194187
rc.remoteServices = cm.conf.StoreFactory.NewWatchStore(
@@ -205,14 +198,14 @@ func (cm *ClusterMesh) NewRemoteCluster(name string, status common.StatusFunc) c
205198
cm.conf.ServiceMerger.MergeExternalServiceDelete,
206199
),
207200
store.RWSWithOnSyncCallback(func(ctx context.Context) { close(rc.synced.services) }),
208-
store.RWSWithEntriesMetric(cm.conf.Metrics.TotalServices.WithLabelValues(cm.conf.ClusterInfo.Name, cm.nodeName, rc.name)),
201+
store.RWSWithEntriesMetric(cm.conf.Metrics.TotalServices.WithLabelValues(rc.name)),
209202
)
210203

211204
rc.ipCacheWatcher = ipcache.NewIPIdentityWatcher(
212205
cm.conf.Logger,
213206
name, cm.conf.IPCache, cm.conf.StoreFactory, source.ClusterMesh,
214207
store.RWSWithOnSyncCallback(func(ctx context.Context) { close(rc.synced.ipcache) }),
215-
store.RWSWithEntriesMetric(cm.conf.Metrics.TotalEndpoints.WithLabelValues(cm.conf.ClusterInfo.Name, cm.nodeName, rc.name)),
208+
store.RWSWithEntriesMetric(cm.conf.Metrics.TotalEndpoints.WithLabelValues(rc.name)),
216209
)
217210
rc.ipCacheWatcherExtraOpts = cm.conf.IPCacheWatcherExtraOpts
218211

pkg/clustermesh/common/clustermesh.go

Lines changed: 6 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ type Configuration struct {
3030

3131
Config
3232

33-
// ClusterInfo is the id/name of the local cluster. This is used for logging and metrics
33+
// ClusterInfo is the id/name of the local cluster.
3434
ClusterInfo types.ClusterInfo
3535

3636
// RemoteClientFactory is the factory to create new backend instances.
@@ -39,9 +39,6 @@ type Configuration struct {
3939
// NewRemoteCluster is a function returning a new implementation of the remote cluster business logic.
4040
NewRemoteCluster RemoteClusterCreatorFunc
4141

42-
// nodeName is the name of the local node. This is used for logging and metrics
43-
NodeName string
44-
4542
// ClusterSizeDependantInterval allows to calculate intervals based on cluster size.
4643
ClusterSizeDependantInterval kvstore.ClusterSizeDependantIntervalFunc
4744

@@ -153,9 +150,9 @@ func (cm *clusterMesh) newRemoteCluster(name, path string) *remoteCluster {
153150
remoteClientFactory: cm.conf.RemoteClientFactory,
154151
clusterLockFactory: newClusterLock,
155152

156-
metricLastFailureTimestamp: cm.conf.Metrics.LastFailureTimestamp.WithLabelValues(cm.conf.ClusterInfo.Name, cm.conf.NodeName, name),
157-
metricReadinessStatus: cm.conf.Metrics.ReadinessStatus.WithLabelValues(cm.conf.ClusterInfo.Name, cm.conf.NodeName, name),
158-
metricTotalFailures: cm.conf.Metrics.TotalFailures.WithLabelValues(cm.conf.ClusterInfo.Name, cm.conf.NodeName, name),
153+
metricLastFailureTimestamp: cm.conf.Metrics.LastFailureTimestamp.WithLabelValues(name),
154+
metricReadinessStatus: cm.conf.Metrics.ReadinessStatus.WithLabelValues(name),
155+
metricTotalFailures: cm.conf.Metrics.TotalFailures.WithLabelValues(name),
159156
}
160157

161158
rc.RemoteCluster = cm.conf.NewRemoteCluster(name, rc.status)
@@ -197,7 +194,7 @@ func (cm *clusterMesh) addLocked(name, path string) {
197194
cm.clusters[name] = cluster
198195
}
199196

200-
cm.conf.Metrics.TotalRemoteClusters.WithLabelValues(cm.conf.ClusterInfo.Name, cm.conf.NodeName).Set(float64(len(cm.clusters)))
197+
cm.conf.Metrics.TotalRemoteClusters.Set(float64(len(cm.clusters)))
201198

202199
cluster.connect()
203200
}
@@ -220,7 +217,7 @@ func (cm *clusterMesh) remove(name string) {
220217

221218
cm.tombstones[name] = removed
222219
delete(cm.clusters, name)
223-
cm.conf.Metrics.TotalRemoteClusters.WithLabelValues(cm.conf.ClusterInfo.Name, cm.conf.NodeName).Set(float64(len(cm.clusters)))
220+
cm.conf.Metrics.TotalRemoteClusters.Set(float64(len(cm.clusters)))
224221

225222
cm.wg.Add(1)
226223
go func() {

0 commit comments

Comments
 (0)