Skip to content

Conversation

last-genius
Copy link
Contributor

xcp-rrdd used to ignore timestamps of measurement times provided by the RRDD plugins, and always calculated change in all values against a single timestamp captured at the time of reading on the server. This can potentially trigger a case where the time that passed between measurements is not equal to the time that passed between readings (or other measurements), and values would be calculated incorrectly. This can be triggered on a host with 100% CPU load, where cpu0-P0 derived metric, for example, would often be > 1.0.

This PR reworks xcp-rrdd and the plugin infrastructure to calculate values based on the actual length of time passed since the last value for this particular metric was recorded. This required extensive architectural changes, explained in detail in commits themselves - so best reviewed by commit.

I've tested this over the weekend, and haven't registered any spikes in metrics on a fully-loaded host. Will run BST+BVT as well, since RRDD is at the center of a lot of other things and is quite fragile.

edwintorok and others added 7 commits October 23, 2024 14:47
This enables extending the type without causing performance issues,
and should reduce the work for the garbage collector.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Instead of getting one timestamp for all collectors, get them closer to the
actual measurement time.

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
…them

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Instead of timestamps taken at read time in xcp-rrdd, propagate timestamps
taken by plugins (or collectors in xcp-rrdd itself) and use these when
updating values.

Also process datasources without flattening them all into one list.

This allows to process datasources with the same timestamp (provided by the
plugin or dom0 collector in xcp_rrdd) at once.

Co-authored-by: Edwin Török <edwin.torok@cloud.com>
Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
RRDs can have different owners, and new ones can show up without new domids
being involved (SRs, VMs are also owner types).

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
ds_update* functions previously relied on being called for the entirety of
datasources of a certain RRD. Instead, operate on chunks of datasources
provided by the same plugin. These chunks are not contiguous (more detailed
explanation in the comments in the code), so indices to rrd_dss and the
associated structures need to be carried with each datasource.

Also turns the value*transform tuple into a type for more explicit accesses.

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
@last-genius last-genius force-pushed the private/asultanov/rrd-timestamp branch from b292aca to fc8b081 Compare October 28, 2024 13:12
@last-genius
Copy link
Contributor Author

NOTE: There's some additional complexity in converting these timestamps to use ptime and mtime for even greater precision, so I've opted to do this separately later on


let timestamp cs = Cstruct.BE.get_uint64 cs timestamp_start
let timestamp cs =
Int64.float_of_bits (Cstruct.BE.get_uint64 cs timestamp_start)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this... I don't think we can get away with the breaking change, there are python-based (GFS2) and C-based plugins (not sure which ones they are) :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a discussion with the storage team, the timestamp-responsible part is in rrdd.py that we control, we'd just need to get a script to restart their metrics collection after update, so this should be do-able

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this later it looks like we actually broke the C plugins completely: xapi-project/rrd-client-lib#15

And storing a double in big-endian is very awkward, I'm not aware of any protocol that would do it that way...

See the linked PR on how complicated it is to actually fill in the right values: you have to get a double in C, take its bits as int64, then convert endianness (without truncating the microseconds).

Would've been a lot easier to just change this to read a double directly, but I guess we're stuck with the awkward protocol now :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could try to change the protocol again and drop the bigendian conversion, but we should first add some (XenRT) tests for this area to check we don't break something again.

For now I think we'll fix the C plugin to match the protocol expected here.

@last-genius
Copy link
Contributor Author

BST+BVT has passed (XenRT run #207011), but there are necessary additions to be made, keeping as a draft. On the other hand, it's sad that at least some of the tests weren't exercising the plugins this must have broken by changing the protocol....

Andrii Sultanov added 4 commits October 29, 2024 14:23
…and archiving

This ensures that the interval between the current timestamp and previous
update is calculated correctly.

Keep the same external interface through rrdd_http_handler by adding an
'internal' parameter to xml exporting functions.

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Add an "unsafe" version of rrd_add_ds to avoid checking if the datasource
already exists twice.

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
…m explicitly

If RRD's datasource disappears and is not reported by the plugin (or collected
by xcp-rrdd), then it's not going to be refreshed, keeping its old value and
timestamp. Previously, this was handled nicely because the whole RRD was
updated at once, and updates were defaulted to VT_Unknown, Identity before
being assigned actually collected values. After RRD collection was changed to
process datasources in chunks per source, we need to explicitly find the ones
that weren't updated on this iteration, and explicitly reset them.

(This is not just a theoretical exercise, it can happen when a VIF gets
unplugged or destroyed, for example, with its stats not being reset and
continuing to display old values)

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Instead of writing Int64 (truncated from timestamp floats) into the
memory-mapped files, keep the precision of the timestamp.

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
@last-genius last-genius force-pushed the private/asultanov/rrd-timestamp branch 2 times, most recently from 7203fe7 to b5aaf44 Compare October 29, 2024 14:32
@last-genius
Copy link
Contributor Author

Since I changed the RRDD plugin protocol, added changes to rrdd.py. Also made internal parameter explicit as requested.

After some investigation and a discussion, it was decided against moving to ptime/mtime altogether. This would be quite complex and would involve a lot of refactoring (again) for no immediate benefit, since calculations are done on floating point values.

I've also opened a PR for rrd-client-lib: xapi-project/rrd-client-lib#14, this needs to be coordinated all together.

Andrii Sultanov added 3 commits October 29, 2024 14:35
RRDD, aside from the OCaml library, exposes two interfaces. This changes the
Python one. The C one lives in rrd-client-lib and needs to be changed at the
same time and coordinated during updating.

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
@last-genius last-genius force-pushed the private/asultanov/rrd-timestamp branch from b5aaf44 to 76e317c Compare October 29, 2024 14:35
Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
@last-genius last-genius marked this pull request as ready for review October 29, 2024 15:14
tag "lastupdate"
(data (Printf.sprintf "%Ld" (Int64.of_float rrd.last_updated)))
output ;
tag "lastupdate" (data (Utils.f_to_s rrd.last_updated)) output ;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this change is explained in the commit message, could you add that this adds decimals to the output format (and this is why the external interface needs to be maintained, which is mentioned in the commit message)

@psafont
Copy link
Member

psafont commented Nov 6, 2024

I think this is a very good advancement in fixing the metrics here. I have two remarks

  • I doubt this can be backported as-is, it's quite substantial. We will need to make a decision to try to create a minimal change to backport this, have an alternative approach that somewhat works around this issue, or leave it unfixed.
  • I'd like to see the changes on the .spec repos that minimize breakage for the non-ocaml plugins

@lindig
Copy link
Contributor

lindig commented Nov 6, 2024

I would not worry about backports. We lived with the problematic behavior for so long that that I don't think a risky backport can be justified just to get better stats.

Copy link
Member

@robhoes robhoes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an impressive set of changes. I've reviewed commit-by-commit and every step makes sense (though I haven't checked all code in detail). I assume it is the case that RRD plugins do not need any code change, as long as they are using the updated library (be it OCaml, python or C). There are also no changed external interfaces.

The main upgrade case is that all plugins need to be restarted as soon as xcp-rrdd has restarted to the new version (on a host). Before a restart, xcp-rrdd will interpret int64s as doubles, which means that it won't necessarily fail, but all data would be completely wrong? Then to upgrade, we need to 1) stop xcp-rrdd, 2) install the new xcp-rrdd and libraries, 3) restart all plugins and 4) start xcp-rrdd. Correct?

@lindig
Copy link
Contributor

lindig commented Nov 8, 2024

This set of changes highlights the difficult to advance the RRD protocol because it is not properly versioned, which is a birth defect and unrelated to these changes.

@last-genius
Copy link
Contributor Author

The main upgrade case is that all plugins need to be restarted as soon as xcp-rrdd has restarted to the new version (on a host). Before a restart, xcp-rrdd will interpret int64s as doubles, which means that it won't necessarily fail, but all data would be completely wrong? Then to upgrade, we need to 1) stop xcp-rrdd, 2) install the new xcp-rrdd and libraries, 3) restart all plugins and 4) start xcp-rrdd. Correct?

Yes!
Before these changes, xcp-rrdd ignored timestamps provided by the plugins altogether, so if an old xcp-rrdd reads new plugins, nothing should break or change. If a new xcp-rrdd reads old plugins, then yes, the data will be slightly wrong. I've opened a draft PR in xapi.spec that modifies the update guidance as suggested here.

@robhoes
Copy link
Member

robhoes commented Nov 8, 2024

Before these changes, xcp-rrdd ignored timestamps provided by the plugins altogether, so if an old xcp-rrdd reads new plugins, nothing should break or change.

Ah right, that simplifies the story a little, as we can update and restart plugins first.

@psafont
Copy link
Member

psafont commented Nov 8, 2024

Before these changes, xcp-rrdd ignored timestamps provided by the plugins altogether, so if an old xcp-rrdd reads new plugins, nothing should break or change.

Because of this, plugins can be updated before xcp-rrdd is without causing any breakage

@last-genius
Copy link
Contributor Author

I think this should be good to go, see the open xapi.spec PR for how the updates are handled.

@last-genius last-genius added this pull request to the merge queue Dec 4, 2024
@last-genius
Copy link
Contributor Author

xapi.spec PR for restarting the dependent plugins was merged, so merging this as well 🤞

Merged via the queue into xapi-project:master with commit 8971c8d Dec 4, 2024
15 checks passed
BengangY added a commit to BengangY/rrd-client-lib that referenced this pull request Aug 1, 2025
The PR xapi-project/xen-api#6086 changed the
timestamp of rrd data from `int64` to `double`. As a result, the timestamp
in RRD messages sent from `pvsproxy` to `xcp-rrdd` is inconsistent between
the two formats, causing parsing failures and resulting in missing data.

Another PR xapi-project#14
attempted to address the issue by introducing a new `get_timestamp`
function that generates the timestamp as a `double`. However, applying
`htonll` to this value unintentionally converted the double-precision
timestamp to an `int64_t`, preserving only the integer part and discarding
the fraction. Since `xcp-rrdd` requires the timestamp in IEEE 754 double
format, this `int64_t` value could not be parsed correctly.

The solution is to returns the timestamp as an `uint64_t` containing the
IEEE 754 bit pattern of the double value.

Signed-off-by: Bengang Yuan <bengang.yuan@cloud.com>
BengangY added a commit to BengangY/rrd-client-lib that referenced this pull request Aug 5, 2025
The PR xapi-project/xen-api#6086 changed the
timestamp of rrd data from `int64` to `double`. As a result, the timestamp
in RRD messages sent from `pvsproxy` to `xcp-rrdd` is inconsistent between
the two formats, causing parsing failures and resulting in missing data.

Another PR xapi-project#14
attempted to address the issue by introducing a new `get_timestamp`
function that generates the timestamp as a `double`. However, applying
`htonll` to this value unintentionally converted the double-precision
timestamp to an `int64_t`, preserving only the integer part and discarding
the fraction. Since `xcp-rrdd` requires the timestamp in IEEE 754 double
format, this `int64_t` value could not be parsed correctly.

The solution is to returns the timestamp as an `uint64_t` containing the
IEEE 754 bit pattern of the double value.

Signed-off-by: Bengang Yuan <bengang.yuan@cloud.com>
BengangY added a commit to BengangY/rrd-client-lib that referenced this pull request Aug 5, 2025
The PR xapi-project/xen-api#6086 changed the
timestamp of rrd data from `int64` to `double`. As a result, the timestamp
in RRD messages sent from `pvsproxy` to `xcp-rrdd` is inconsistent between
the two formats, causing parsing failures and resulting in missing data.

Another PR xapi-project#14
attempted to address the issue by introducing a new get_timestamp function
that generates the timestamp as a double. However, applying htonll to this
value unintentionally causes implicit conversion from the double-precision
to int64_t, which preserves only the integer part and discards the
fraction. Since xcp-rrdd requires the timestamp in IEEE 754 double format,
this int64_t value could not be parsed correctly.

The solution is to returns the timestamp as an `uint64_t` containing the
IEEE 754 bit pattern of the double value.

Signed-off-by: Bengang Yuan <bengang.yuan@cloud.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants