Description
Was getting some clarification from @bnaecker about metrics types and he mentioned this issue with cumulative metrics:
timestamp
exists for everyMeasurement
, and it's the time the sample was taken.start_time
exists forCumulative
andHistogram
types, and is the zero-point or reference.One thing to keep in mind is that the same timeseries can have measurements with different
start_times
. E.g., if a service restarts, thatstart_time
will be reset. I don't know if that matters much now, but it will at some point.
From an end-user point of view, a cumulative metric that resets in this way is wrong, or at least misleading: it looks like a sawtooth when it should be a monotonically increasing line. One can imagine various ways of fixing this:
- In the database
- In Nexus
- On the client
The latter two options are not fully general because in order to get the correct offset for data after restart N
, you need to know the last data point before restarts 1..N
so you can add them all up. That means that if you're looking at data in a certain window of time, you also need to pull data from outside that window to do the correction. For these reason, the database solution is likely best. It would avoid post-hoc corrections — fetching data in a given window would simply give you the right data for that window.
The only downsides I can think of to the DB approach are:
- We have to do some work we haven't already done (obviously)
- The telemetry data stored in the DB would lose some information, namely the fact that these restarts in data collection occurred. In my view, however, if this information is important to keep around, a cumulative metric is not right the place to store it.