CA-391651: Fix spike in derived RRD metrics #6086

last-genius · 2024-10-28T13:10:28Z

xcp-rrdd used to ignore timestamps of measurement times provided by the RRDD plugins, and always calculated change in all values against a single timestamp captured at the time of reading on the server. This can potentially trigger a case where the time that passed between measurements is not equal to the time that passed between readings (or other measurements), and values would be calculated incorrectly. This can be triggered on a host with 100% CPU load, where cpu0-P0 derived metric, for example, would often be > 1.0.

This PR reworks xcp-rrdd and the plugin infrastructure to calculate values based on the actual length of time passed since the last value for this particular metric was recorded. This required extensive architectural changes, explained in detail in commits themselves - so best reviewed by commit.

I've tested this over the weekend, and haven't registered any spikes in metrics on a fully-loaded host. Will run BST+BVT as well, since RRDD is at the center of a lot of other things and is quite fragile.

This enables extending the type without causing performance issues, and should reduce the work for the garbage collector. Signed-off-by: Edwin Török <edwin.torok@cloud.com>

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>

Instead of getting one timestamp for all collectors, get them closer to the actual measurement time. Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>

…them Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>

Instead of timestamps taken at read time in xcp-rrdd, propagate timestamps taken by plugins (or collectors in xcp-rrdd itself) and use these when updating values. Also process datasources without flattening them all into one list. This allows to process datasources with the same timestamp (provided by the plugin or dom0 collector in xcp_rrdd) at once. Co-authored-by: Edwin Török <edwin.torok@cloud.com> Signed-off-by: Edwin Török <edwin.torok@cloud.com> Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>

RRDs can have different owners, and new ones can show up without new domids being involved (SRs, VMs are also owner types). Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>

ds_update* functions previously relied on being called for the entirety of datasources of a certain RRD. Instead, operate on chunks of datasources provided by the same plugin. These chunks are not contiguous (more detailed explanation in the comments in the code), so indices to rrd_dss and the associated structures need to be carried with each datasource. Also turns the value*transform tuple into a type for more explicit accesses. Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>

last-genius · 2024-10-28T13:27:06Z

NOTE: There's some additional complexity in converting these timestamps to use ptime and mtime for even greater precision, so I've opted to do this separately later on

ocaml/libs/xapi-rrd/unix/rrd_unix.ml

psafont · 2024-10-28T15:31:56Z

ocaml/xcp-rrdd/lib/transport/base/rrd_protocol_v2.ml


-  let timestamp cs = Cstruct.BE.get_uint64 cs timestamp_start
+  let timestamp cs =
+    Int64.float_of_bits (Cstruct.BE.get_uint64 cs timestamp_start)


Thinking about this... I don't think we can get away with the breaking change, there are python-based (GFS2) and C-based plugins (not sure which ones they are) :(

After a discussion with the storage team, the timestamp-responsible part is in rrdd.py that we control, we'd just need to get a script to restart their metrics collection after update, so this should be do-able

Looking at this later it looks like we actually broke the C plugins completely: xapi-project/rrd-client-lib#15

And storing a double in big-endian is very awkward, I'm not aware of any protocol that would do it that way...

See the linked PR on how complicated it is to actually fill in the right values: you have to get a double in C, take its bits as int64, then convert endianness (without truncating the microseconds).

Would've been a lot easier to just change this to read a double directly, but I guess we're stuck with the awkward protocol now :(

We could try to change the protocol again and drop the bigendian conversion, but we should first add some (XenRT) tests for this area to check we don't break something again.

For now I think we'll fix the C plugin to match the protocol expected here.

last-genius · 2024-10-29T08:21:31Z

BST+BVT has passed (XenRT run #207011), but there are necessary additions to be made, keeping as a draft. On the other hand, it's sad that at least some of the tests weren't exercising the plugins this must have broken by changing the protocol....

…and archiving This ensures that the interval between the current timestamp and previous update is calculated correctly. Keep the same external interface through rrdd_http_handler by adding an 'internal' parameter to xml exporting functions. Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>

Add an "unsafe" version of rrd_add_ds to avoid checking if the datasource already exists twice. Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>

…m explicitly If RRD's datasource disappears and is not reported by the plugin (or collected by xcp-rrdd), then it's not going to be refreshed, keeping its old value and timestamp. Previously, this was handled nicely because the whole RRD was updated at once, and updates were defaulted to VT_Unknown, Identity before being assigned actually collected values. After RRD collection was changed to process datasources in chunks per source, we need to explicitly find the ones that weren't updated on this iteration, and explicitly reset them. (This is not just a theoretical exercise, it can happen when a VIF gets unplugged or destroyed, for example, with its stats not being reset and continuing to display old values) Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>

Instead of writing Int64 (truncated from timestamp floats) into the memory-mapped files, keep the precision of the timestamp. Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>

last-genius · 2024-10-29T14:33:03Z

Since I changed the RRDD plugin protocol, added changes to rrdd.py. Also made internal parameter explicit as requested.

After some investigation and a discussion, it was decided against moving to ptime/mtime altogether. This would be quite complex and would involve a lot of refactoring (again) for no immediate benefit, since calculations are done on floating point values.

I've also opened a PR for rrd-client-lib: xapi-project/rrd-client-lib#14, this needs to be coordinated all together.

RRDD, aside from the OCaml library, exposes two interfaces. This changes the Python one. The C one lives in rrd-client-lib and needs to be changed at the same time and coordinated during updating. Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>

psafont · 2024-11-06T11:04:01Z

ocaml/libs/xapi-rrd/lib/rrd.ml

-      tag "lastupdate"
-        (data (Printf.sprintf "%Ld" (Int64.of_float rrd.last_updated)))
-        output ;
+      tag "lastupdate" (data (Utils.f_to_s rrd.last_updated)) output ;


I don't think this change is explained in the commit message, could you add that this adds decimals to the output format (and this is why the external interface needs to be maintained, which is mentioned in the commit message)

psafont · 2024-11-06T11:17:53Z

I think this is a very good advancement in fixing the metrics here. I have two remarks

I doubt this can be backported as-is, it's quite substantial. We will need to make a decision to try to create a minimal change to backport this, have an alternative approach that somewhat works around this issue, or leave it unfixed.
I'd like to see the changes on the .spec repos that minimize breakage for the non-ocaml plugins

lindig · 2024-11-06T14:57:17Z

I would not worry about backports. We lived with the problematic behavior for so long that that I don't think a risky backport can be justified just to get better stats.

robhoes

This is an impressive set of changes. I've reviewed commit-by-commit and every step makes sense (though I haven't checked all code in detail). I assume it is the case that RRD plugins do not need any code change, as long as they are using the updated library (be it OCaml, python or C). There are also no changed external interfaces.

The main upgrade case is that all plugins need to be restarted as soon as xcp-rrdd has restarted to the new version (on a host). Before a restart, xcp-rrdd will interpret int64s as doubles, which means that it won't necessarily fail, but all data would be completely wrong? Then to upgrade, we need to 1) stop xcp-rrdd, 2) install the new xcp-rrdd and libraries, 3) restart all plugins and 4) start xcp-rrdd. Correct?

lindig · 2024-11-08T09:09:31Z

This set of changes highlights the difficult to advance the RRD protocol because it is not properly versioned, which is a birth defect and unrelated to these changes.

last-genius · 2024-11-08T09:32:22Z

The main upgrade case is that all plugins need to be restarted as soon as xcp-rrdd has restarted to the new version (on a host). Before a restart, xcp-rrdd will interpret int64s as doubles, which means that it won't necessarily fail, but all data would be completely wrong? Then to upgrade, we need to 1) stop xcp-rrdd, 2) install the new xcp-rrdd and libraries, 3) restart all plugins and 4) start xcp-rrdd. Correct?

Yes!
Before these changes, xcp-rrdd ignored timestamps provided by the plugins altogether, so if an old xcp-rrdd reads new plugins, nothing should break or change. If a new xcp-rrdd reads old plugins, then yes, the data will be slightly wrong. I've opened a draft PR in xapi.spec that modifies the update guidance as suggested here.

robhoes · 2024-11-08T10:25:54Z

Before these changes, xcp-rrdd ignored timestamps provided by the plugins altogether, so if an old xcp-rrdd reads new plugins, nothing should break or change.

Ah right, that simplifies the story a little, as we can update and restart plugins first.

psafont · 2024-11-08T10:25:57Z

Before these changes, xcp-rrdd ignored timestamps provided by the plugins altogether, so if an old xcp-rrdd reads new plugins, nothing should break or change.

Because of this, plugins can be updated before xcp-rrdd is without causing any breakage

last-genius · 2024-11-19T11:03:30Z

I think this should be good to go, see the open xapi.spec PR for how the updates are handled.

last-genius · 2024-12-04T10:26:07Z

xapi.spec PR for restarting the dependent plugins was merged, so merging this as well 🤞

The PR xapi-project/xen-api#6086 changed the timestamp of rrd data from `int64` to `double`. As a result, the timestamp in RRD messages sent from `pvsproxy` to `xcp-rrdd` is inconsistent between the two formats, causing parsing failures and resulting in missing data. Another PR xapi-project#14 attempted to address the issue by introducing a new `get_timestamp` function that generates the timestamp as a `double`. However, applying `htonll` to this value unintentionally converted the double-precision timestamp to an `int64_t`, preserving only the integer part and discarding the fraction. Since `xcp-rrdd` requires the timestamp in IEEE 754 double format, this `int64_t` value could not be parsed correctly. The solution is to returns the timestamp as an `uint64_t` containing the IEEE 754 bit pattern of the double value. Signed-off-by: Bengang Yuan <bengang.yuan@cloud.com>

The PR xapi-project/xen-api#6086 changed the timestamp of rrd data from `int64` to `double`. As a result, the timestamp in RRD messages sent from `pvsproxy` to `xcp-rrdd` is inconsistent between the two formats, causing parsing failures and resulting in missing data. Another PR xapi-project#14 attempted to address the issue by introducing a new get_timestamp function that generates the timestamp as a double. However, applying htonll to this value unintentionally causes implicit conversion from the double-precision to int64_t, which preserves only the integer part and discards the fraction. Since xcp-rrdd requires the timestamp in IEEE 754 double format, this int64_t value could not be parsed correctly. The solution is to returns the timestamp as an `uint64_t` containing the IEEE 754 bit pattern of the double value. Signed-off-by: Bengang Yuan <bengang.yuan@cloud.com>

edwintorok and others added 7 commits October 23, 2024 14:47

rrdd: avoid constructing intermediate lists, use Seq

f0b6322

This enables extending the type without causing performance issues, and should reduce the work for the garbage collector. Signed-off-by: Edwin Török <edwin.torok@cloud.com>

CA-391651 - rrd: Remove deprecated member of rra struct

8b2b2d2

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>

CA-391651: Make timestamps of data collectors in xcp-rrdd independent

cacf52a

Instead of getting one timestamp for all collectors, get them closer to the actual measurement time. Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>

CA-391651: rrdd_server - read plugins' timestamps, don't just ignore …

a7bc62d

…them Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>

CA-391651: Rename 'new_domid' parameter to 'new_rrd'

0a6daf2

RRDs can have different owners, and new ones can show up without new domids being involved (SRs, VMs are also owner types). Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>

last-genius force-pushed the private/asultanov/rrd-timestamp branch from b292aca to fc8b081 Compare October 28, 2024 13:12

psafont reviewed Oct 28, 2024

View reviewed changes

ocaml/libs/xapi-rrd/unix/rrd_unix.ml Outdated Show resolved Hide resolved

psafont reviewed Oct 28, 2024

View reviewed changes

Andrii Sultanov added 4 commits October 29, 2024 14:23

CA-391651: rrd - don't iterate over lists needlessly

7052ddd

Add an "unsafe" version of rrd_add_ds to avoid checking if the datasource already exists twice. Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>

CA-391651 - rrd protocol: Stop truncating timestamps to seconds

8d7c057

Instead of writing Int64 (truncated from timestamp floats) into the memory-mapped files, keep the precision of the timestamp. Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>

last-genius force-pushed the private/asultanov/rrd-timestamp branch 2 times, most recently from 7203fe7 to b5aaf44 Compare October 29, 2024 14:32

Andrii Sultanov added 3 commits October 29, 2024 14:35

CA-391651 rrd: Don't truncate timestamps when calculating values

b3da1c9

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>

CA-391651: Update RRD tests to the new interfaces

76e317c

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>

last-genius force-pushed the private/asultanov/rrd-timestamp branch from b5aaf44 to 76e317c Compare October 29, 2024 14:35

CA-391651 - docs: Update RRD design pages

1d7fe35

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>

last-genius marked this pull request as ready for review October 29, 2024 15:14

psafont reviewed Nov 6, 2024

View reviewed changes

robhoes approved these changes Nov 7, 2024

View reviewed changes

lindig approved these changes Nov 11, 2024

View reviewed changes

last-genius added this pull request to the merge queue Dec 4, 2024

Merged via the queue into xapi-project:master with commit 8971c8d Dec 4, 2024
15 checks passed

BengangY mentioned this pull request Aug 1, 2025

CA-414341: PVS RRD graph does not show xapi-project/rrd-client-lib#15

Merged

CA-391651: Fix spike in derived RRD metrics #6086

CA-391651: Fix spike in derived RRD metrics #6086

Uh oh!

Conversation

last-genius commented Oct 28, 2024

Uh oh!

last-genius commented Oct 28, 2024

Uh oh!

Uh oh!

psafont Oct 28, 2024

Choose a reason for hiding this comment

Uh oh!

last-genius Oct 28, 2024

Choose a reason for hiding this comment

Uh oh!

edwintorok Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

edwintorok Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

last-genius commented Oct 29, 2024

Uh oh!

last-genius commented Oct 29, 2024

Uh oh!

psafont Nov 6, 2024

Choose a reason for hiding this comment

Uh oh!

psafont commented Nov 6, 2024

Uh oh!

lindig commented Nov 6, 2024

Uh oh!

robhoes left a comment

Choose a reason for hiding this comment

Uh oh!

lindig commented Nov 8, 2024

Uh oh!

last-genius commented Nov 8, 2024

Uh oh!

robhoes commented Nov 8, 2024

Uh oh!

psafont commented Nov 8, 2024

Uh oh!

last-genius commented Nov 19, 2024

Uh oh!

last-genius commented Dec 4, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants