-
Notifications
You must be signed in to change notification settings - Fork 238
Idiamond/prometheus and statsd #113
Idiamond/prometheus and statsd #113
Conversation
a4088c0
to
b3301d7
Compare
@idiamond-stripe wow, thanks- this looks great. I'm going to focus on trying to get #109 merged and then review this with a little more time. It all looks good- only thing I'd like to check is the names of metrics that are published/tracked. Is there any chance you could add a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks- I've added a few comments throughout. I quite like the model of defining a bunch of vars for the various metrics, makes it easier to see a list of everything that's tracked.
Biggest issue is likely to be the removal of the telemetry decorator type but hopefully that's either easily replaced with gRPC middleware of adding it back.
credentialTimings := metrics.GetOrRegisterTimer("credentialsHandler", metrics.DefaultRegistry) | ||
startTime := time.Now() | ||
defer credentialTimings.UpdateSince(startTime) | ||
timer := prometheus.NewTimer(handlerTimer.WithLabelValues("credentials")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super familiar with the prometheus client library- is it expected to create timers inline with function calls or should the timer be constructed and stored as a field on the credentialsHandler
struct?
I guess this is also true for the statsd timer below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine to just create a timer inline. A timer
in the Prometheus library is just a single timing for a function call, so it makes sense to defer
the completion of it until the end of the fn.
pkg/server/telemetry.go
Outdated
timer := metrics.GetOrRegisterTimer("client.GetPodRole", metrics.DefaultRegistry) | ||
startTime := time.Now() | ||
defer timer.UpdateSince(startTime) | ||
timer := prometheus.NewTimer(rpcTimer.With(prometheus.Labels{"rpc": "GetPodRole", "type": "client"})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really sorry but in #109 I removed the telemetry client/server interfaces. Mostly that was to make it easier to move stuff but it always niggled me that it wasn't the right/best way to do this.
If there's a more idiomatic way that'd be great to incorporate- perhaps middleware?
I've created another issue #114 also to track this and make sure it's done pre-v3 release.
I've now merged the mega #109 PR which changed a bunch of stuff around. Unfortunately that breaks a bunch of the changes in this PR. I'll see if I can help work through the pain but not sure how much time I'll get over the next few days. Really sorry to create more work for you @idiamond-stripe so will try and help as much as I can. The Other changes should still stand but potentially just need moving to different files. |
@Sambooo is there any chance this is something you could help go through and update to bring in line with the new handler and testing stuff we've done please? |
@pingles, sorry, I haven't had a chance to work on this more. I'll set
aside some time this week to try and rebase it on top of your changes :)
…On Mon, Jul 16, 2018 at 8:50 AM Paul Ingles ***@***.***> wrote:
@Sambooo <https://github.com/sambooo> is there any chance this is
something you could help go through and update to bring in line with the
new handler and testing stuff we've done please?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#113 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AkAmU1qs5cfygzVpONZucMeyxsswPYizks5uHLY4gaJpZM4VGN7c>
.
|
That'd be awesome, thanks @idiamond-stripe. @Sambooo hold off on picking this up for now then please |
f90cd84
to
3907edf
Compare
- Replace go-metrics Prometheus metrics with the Prometheus go client library - Replace go-metrics statsd with a client that does not perform in memory aggregation - Add additional logging as to the duration of requests
3907edf
to
3d120a2
Compare
Alright @pingles, rebased on top of your changes and went ahead and added a That wasn't too bad, and hopefully everything here makes sense. Wasn't quite sure how to discuss the RPC metrics in the documentation, and if you have any insight that would be great! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for you hard work to rebase. I've left a few comments inline, some were questions more than suggested changes :)
Really appreciate all this!
### Prometheus | ||
#### Metadata Subsystem | ||
- `handler_latency_milliseconds` - Bucketed histogram of handler timings. Tagged by handler | ||
- `credential_fetch_error` - Number of errors fetching the credentials for a pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be renamed to credential_fetch_errors_total
to indicate its an accumulating counter?
#### Metadata Subsystem | ||
- `handler_latency_milliseconds` - Bucketed histogram of handler timings. Tagged by handler | ||
- `credential_fetch_error` - Number of errors fetching the credentials for a pod | ||
- `credential_encode_error` - Number of errors encoding credentials for a pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, should this be credential_encode_errors_total
? Or is it not an accumulating counter?
prometheus.CounterOpts{ | ||
Namespace: "kiam", | ||
Subsystem: "sts", | ||
Name: "error_issuing_count", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To match the other metrics could we rename to issuing_errors
or issuing_errors_total
please?
- `cache_miss_total` - Number of cache misses to the metadata cache | ||
- `error_issuing_count` - Number of errors issuing credentials | ||
- `assumerole_timing_milliseconds` - Bucketed histogram of assumeRole timings | ||
- `assume_role_executing_total` - Number of assume role calls currently executing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether we should use the _total
suffix for a gauge? Maybe rename to assumerole_current
or assumerole_current_operations
?
I'm going to close this PR in favour of #131 which picks up your contribution with a few changes. Hope that's ok. Thanks again for all your hard work @idiamond-stripe! |
This change replaces
go-metrics
with the standard Prometheus library and emits StatsD timing metrics for each request, rather than aggregating it in memory.This PR creates
metrics.go
files on the package level for each package it monitors and then emits metrics in the centralizedprometheus
package. This also creates astatsd
package that can emit metrics to a StatsD sink on demand.Avoiding in-memory metric aggregation means aggregation can be performed in the metrics engine like SIgnalFX or DataDog or Graphite. This allows users to more easily monitor the p99s as they'd like and determine when individual requests endure latency spikes, rather than simply seeing the view in aggregate.
go-metrics
metrics that wayResolves #35