Skip to content

drbd_peerdevice_outofsync_bytes(oos) metric never goes down to 0. #128

@megastallman

Description

@megastallman

We are using the following Grafana board for DRBD-reactor monitoring: https://grafana.com/grafana/dashboards/14339-drbd/
Since we've created our first Proxmox VMs on top of Linstor DRBDs, we've started getting drbd_peerdevice_outofsync_bytes metric greater than 0 and that makes us concern about possible data losses in case of any node gets crashed. This metric's color is red, so it should indicate for a problem. The actual values are between 3 and 60 megabytes. During volume resyncs it raises up to 500 megabytes, but then goes back down to 3-60.

Meanwhile drbdam status command shows that all resources are UpToDate on all nodes. Does it mean that they are in sync? Probably yes, but...

While checking the DRBD-reactor source code I've noticed that the dashboard expects for drbd_peerdevice_outofsync_bytes to be equal to 0 to get ignored.

example/grafana-dashboard.json:404:          "expr": "sum(drbd_peerdevice_outofsync_bytes{instance=~\"$instance\"}) by(name, instance) > 0",
example/grafana-dashboard.json:541:          "expr": "drbd_peerdevice_outofsync_bytes{instance=~\"$instance\"} > 0",
example/grafana-dashboard.json:552:          "expr": "drbd_peerdevice_outofsync_bytes > 0",

Also in src/plugin/prometheus.rs we may notice that the result equals to out_of_sync * 1024. So, lets collect all drbd_peerdevice_outofsync_bytes results:

for i in $(seq 1 5); do echo ${i}; ssh linstor-node${i} curl -s http://127.0.0.1:9942/metrics | grep outofsync; done | grep -v 'volume="0"} 0'
...
drbd_peerdevice_outofsync_bytes{name="pm-f25a57b2",conn_name="phx1n3.hosting.namecheap.net",peer_node_id="4",volume="0"} 3891200
...

Get the orininal oos value:

echo '3891200 / 1024' | bc -l
3800.00000000000000000000

And finally grep it througout Sysfs:

for i in $(seq 1 5); do echo ${i}; ssh linstor-node${i} grep -nri 'oos:' /sys/kernel/debug/drbd/ | grep 3800; done | grep -v 'volume="0"} 0'
...
/sys/kernel/debug/drbd/resources/pm-f25a57b2/connections/phx1n3.hosting.namecheap.net/0/proc_drbd:2:    ns:2900 nr:0 dw:788936076 dr:1013249328 al:1597 bm:617725 lo:0 pe:[0;0] ua:0 ap:[0;0] ep:1 wo:2 oos:3800
...

Now we can see that the same 3800 value corresponds to volume pm-f25a57b2. So why is oos value still non-zero? Is it safe for us to have it non-zero? Are DRBD resources still in sync?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions