-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
*: add v3 snapshot metrics (fsync, network) #9997
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9997 +/- ##
=========================================
+ Coverage 69.24% 69.65% +0.4%
=========================================
Files 386 386
Lines 36030 36057 +27
=========================================
+ Hits 24948 25114 +166
+ Misses 9276 9158 -118
+ Partials 1806 1785 -21
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks right. A couple minor suggestions for the names, help strings and bucket sizes of the metrics.
etcdserver/api/rafthttp/http.go
Outdated
if r.Method != "POST" { | ||
w.Header().Set("Allow", "POST") | ||
http.Error(w, "Method Not Allowed", http.StatusMethodNotAllowed) | ||
snapshotReceiveFailures.WithLabelValues("").Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
""
feels a bit vague. Should we add a constant we always use here that is more descriptive?
etcdserver/api/rafthttp/metrics.go
Outdated
Namespace: "etcd", | ||
Subsystem: "network", | ||
Name: "snapshot_send", | ||
Help: "Total number of snapshot sends", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Operators might not understand what this metric is for. They're more familiar with "snapshot save" and "snapshot restore" commands. Maybe clarify it a bit more? e.g. "Total number of snapshots send from a member to a client or another member" ?
etcdserver/api/rafthttp/metrics.go
Outdated
snapshotSend = prometheus.NewCounterVec(prometheus.CounterOpts{ | ||
Namespace: "etcd", | ||
Subsystem: "network", | ||
Name: "snapshot_send", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for successful sends only, right? Maybe just clarify that in the help string so that it's clear that the total sent count would be this + snapshot_send_failures
? Alternatively we could name this "snapshot_send_success", but that seems a bit too verbose..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Will suffix success
.
etcdserver/api/rafthttp/metrics.go
Outdated
Help: "Total latency distributions of v3 snapshot sends", | ||
|
||
// lowest bucket start of upper bound 0.001 sec (1 ms) with factor 2 | ||
// highest bucket start of 0.001 sec * 2^13 == 8.192 sec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 8 sec enough for the highest bucket? Would we spare a couple more metric entries to get this up to maybe 30 sec or so? Might be helpful if/when things go wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I will update the highest bucket up to 30-second.
etcd_snap_db_fsync_duration_seconds_count etcd_snap_db_save_total_duration_seconds_bucket Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
Distribution would be: 0.1 second or more ... 25.6 seconds or more 51.2 seconds or more etcd_network_snapshot_send_success etcd_network_snapshot_send_failures etcd_network_snapshot_send_total_duration_seconds etcd_network_snapshot_receive_success etcd_network_snapshot_receive_failures etcd_network_snapshot_receive_total_duration_seconds Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
@jpbetz All addressed. PTAL. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
…econds" metric Currently, only v2 metrics ("stats.FollowerStats") tracks Raft message send latencies. Add Prometheus histogram to track Raft messages for writes, since heartbeats are probed (see etcd-io#10022) and snapshots are already being tracked via etcd-io#9997. ``` etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0001"} 1 etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0002"} 1 etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0001"} 9 etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0002"} 9 etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0001"} 8 etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0002"} 8 ``` Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
…97-upstream-release-3.1 Automated cherry pick of #9997
…econds" metric Currently, only v2 metrics ("stats.FollowerStats") tracks Raft message send latencies. Add Prometheus histogram to track Raft messages for writes, since heartbeats are probed (see etcd-io#10022) and snapshots are already being tracked via etcd-io#9997. ``` etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0001"} 1 etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0002"} 1 etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0001"} 9 etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0002"} 9 etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0001"} 8 etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0002"} 8 ``` Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
…97-upstream-release-3.2 Automated cherry pick of #9997
…97-upstream-release-3.3 Automated cherry pick of #9997
v3 snapshots are large and critical operations, often affecting the cluster availabilities. Currently, we do not have any metrics around v3 snapshots. Only way to debug is looking at server logs after something bad happens.
to monitor v3 snapshot save operations on local node.
to monitor v3 snapshot operations between remote peers.
Distribution would be:
This records successful snapshot sends/receives as well, because frequent snapshots affect cluster availabilities, as bad as spikes in
etcd_network_snapshot_send/receive_failures
.Will update http://etcd.readthedocs.io/en/latest and operation docs in following PRs.
ref. #9438
/cc @wenjiaswe