-
Notifications
You must be signed in to change notification settings - Fork 105
Description
We are using the following Grafana board for DRBD-reactor monitoring: https://grafana.com/grafana/dashboards/14339-drbd/
Since we've created our first Proxmox VMs on top of Linstor DRBDs, we've started getting drbd_peerdevice_outofsync_bytes metric greater than 0 and that makes us concern about possible data losses in case of any node gets crashed. This metric's color is red, so it should indicate for a problem. The actual values are between 3 and 60 megabytes. During volume resyncs it raises up to 500 megabytes, but then goes back down to 3-60.
Meanwhile drbdam status command shows that all resources are UpToDate on all nodes. Does it mean that they are in sync? Probably yes, but...
While checking the DRBD-reactor source code I've noticed that the dashboard expects for drbd_peerdevice_outofsync_bytes to be equal to 0 to get ignored.
example/grafana-dashboard.json:404: "expr": "sum(drbd_peerdevice_outofsync_bytes{instance=~\"$instance\"}) by(name, instance) > 0",
example/grafana-dashboard.json:541: "expr": "drbd_peerdevice_outofsync_bytes{instance=~\"$instance\"} > 0",
example/grafana-dashboard.json:552: "expr": "drbd_peerdevice_outofsync_bytes > 0",
Also in src/plugin/prometheus.rs we may notice that the result equals to out_of_sync * 1024. So, lets collect all drbd_peerdevice_outofsync_bytes results:
for i in $(seq 1 5); do echo ${i}; ssh linstor-node${i} curl -s http://127.0.0.1:9942/metrics | grep outofsync; done | grep -v 'volume="0"} 0'
...
drbd_peerdevice_outofsync_bytes{name="pm-f25a57b2",conn_name="phx1n3.hosting.namecheap.net",peer_node_id="4",volume="0"} 3891200
...
Get the orininal oos value:
echo '3891200 / 1024' | bc -l
3800.00000000000000000000
And finally grep it througout Sysfs:
for i in $(seq 1 5); do echo ${i}; ssh linstor-node${i} grep -nri 'oos:' /sys/kernel/debug/drbd/ | grep 3800; done | grep -v 'volume="0"} 0'
...
/sys/kernel/debug/drbd/resources/pm-f25a57b2/connections/phx1n3.hosting.namecheap.net/0/proc_drbd:2: ns:2900 nr:0 dw:788936076 dr:1013249328 al:1597 bm:617725 lo:0 pe:[0;0] ua:0 ap:[0;0] ep:1 wo:2 oos:3800
...
Now we can see that the same 3800 value corresponds to volume pm-f25a57b2. So why is oos value still non-zero? Is it safe for us to have it non-zero? Are DRBD resources still in sync?