drbd_peerdevice_outofsync_bytes(oos) metric never goes down to 0.

We are using the following Grafana board for DRBD-reactor monitoring: https://grafana.com/grafana/dashboards/14339-drbd/
Since we've created our first Proxmox VMs on top of Linstor DRBDs, we've started getting **drbd_peerdevice_outofsync_bytes** metric greater than 0 and that makes us concern about possible data losses in case of any node gets crashed. This metric's color is red, so it should indicate for a problem. The actual values are between 3 and 60 megabytes. During volume resyncs it raises up to 500 megabytes, but then goes back down to 3-60.

Meanwhile **drbdam status** command shows that all resources are **UpToDate** on all nodes. Does it mean that they are in sync? Probably yes, but...

While checking the DRBD-reactor source code I've noticed that the dashboard expects for **drbd_peerdevice_outofsync_bytes** to be equal to 0 to get ignored.
```
example/grafana-dashboard.json:404:          "expr": "sum(drbd_peerdevice_outofsync_bytes{instance=~\"$instance\"}) by(name, instance) > 0",
example/grafana-dashboard.json:541:          "expr": "drbd_peerdevice_outofsync_bytes{instance=~\"$instance\"} > 0",
example/grafana-dashboard.json:552:          "expr": "drbd_peerdevice_outofsync_bytes > 0",
```
Also in **src/plugin/prometheus.rs** we  may notice that the result equals to **out_of_sync * 1024**. So, lets collect all **drbd_peerdevice_outofsync_bytes** results:
```
for i in $(seq 1 5); do echo ${i}; ssh linstor-node${i} curl -s http://127.0.0.1:9942/metrics | grep outofsync; done | grep -v 'volume="0"} 0'
...
drbd_peerdevice_outofsync_bytes{name="pm-f25a57b2",conn_name="phx1n3.hosting.namecheap.net",peer_node_id="4",volume="0"} 3891200
...
```
Get the orininal **oos** value:
```
echo '3891200 / 1024' | bc -l
3800.00000000000000000000
```
And finally grep it througout Sysfs:
```
for i in $(seq 1 5); do echo ${i}; ssh linstor-node${i} grep -nri 'oos:' /sys/kernel/debug/drbd/ | grep 3800; done | grep -v 'volume="0"} 0'
...
/sys/kernel/debug/drbd/resources/pm-f25a57b2/connections/phx1n3.hosting.namecheap.net/0/proc_drbd:2:    ns:2900 nr:0 dw:788936076 dr:1013249328 al:1597 bm:617725 lo:0 pe:[0;0] ua:0 ap:[0;0] ep:1 wo:2 oos:3800
...
```
Now we can see that the same **3800** value corresponds to volume **pm-f25a57b2**. So why is **oos** value still non-zero? Is it safe for us to have it non-zero? Are DRBD resources still in sync?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

drbd_peerdevice_outofsync_bytes(oos) metric never goes down to 0. #128

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

drbd_peerdevice_outofsync_bytes(oos) metric never goes down to 0. #128

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions