Idiamond/prometheus and statsd #113

idiamond-stripe · 2018-07-07T04:03:28Z

This change replaces go-metrics with the standard Prometheus library and emits StatsD timing metrics for each request, rather than aggregating it in memory.

This PR creates metrics.go files on the package level for each package it monitors and then emits metrics in the centralized prometheus package. This also creates a statsd package that can emit metrics to a StatsD sink on demand.

Avoiding in-memory metric aggregation means aggregation can be performed in the metrics engine like SIgnalFX or DataDog or Graphite. This allows users to more easily monitor the p99s as they'd like and determine when individual requests endure latency spikes, rather than simply seeing the view in aggregate.

Replace go-metrics Prometheus metrics with the Prometheus go client library and emit all previous go-metrics metrics that way
Replace go-metrics StatsD with a client that does not perform in-memory aggregation and emit timing metrics that way
Add additional logging as to the duration of requests

Resolves #35

pingles · 2018-07-09T11:04:57Z

@idiamond-stripe wow, thanks- this looks great.

I'm going to focus on trying to get #109 merged and then review this with a little more time. It all looks good- only thing I'd like to check is the names of metrics that are published/tracked.

Is there any chance you could add a doc/telemetry.md file that has a table of the different metric names and what they measure please? Would be super useful- both for this and afterwards for people operating kiam.

pingles

Thanks- I've added a few comments throughout. I quite like the model of defining a bunch of vars for the various metrics, makes it easier to see a list of everything that's tracked.

Biggest issue is likely to be the removal of the telemetry decorator type but hopefully that's either easily replaced with gRPC middleware of adding it back.

pingles · 2018-07-09T13:25:13Z

pkg/aws/metadata/handler_credentials.go

-	credentialTimings := metrics.GetOrRegisterTimer("credentialsHandler", metrics.DefaultRegistry)
-	startTime := time.Now()
-	defer credentialTimings.UpdateSince(startTime)
+	timer := prometheus.NewTimer(handlerTimer.WithLabelValues("credentials"))


I'm not super familiar with the prometheus client library- is it expected to create timers inline with function calls or should the timer be constructed and stored as a field on the credentialsHandler struct?

I guess this is also true for the statsd timer below

It's fine to just create a timer inline. A timer in the Prometheus library is just a single timing for a function call, so it makes sense to defer the completion of it until the end of the fn.

pingles · 2018-07-09T13:35:57Z

pkg/server/telemetry.go

-	timer := metrics.GetOrRegisterTimer("client.GetPodRole", metrics.DefaultRegistry)
-	startTime := time.Now()
-	defer timer.UpdateSince(startTime)
+	timer := prometheus.NewTimer(rpcTimer.With(prometheus.Labels{"rpc": "GetPodRole", "type": "client"}))


Really sorry but in #109 I removed the telemetry client/server interfaces. Mostly that was to make it easier to move stuff but it always niggled me that it wasn't the right/best way to do this.

If there's a more idiomatic way that'd be great to incorporate- perhaps middleware?

I've created another issue #114 also to track this and make sure it's done pre-v3 release.

pingles · 2018-07-12T10:02:44Z

I've now merged the mega #109 PR which changed a bunch of stuff around. Unfortunately that breaks a bunch of the changes in this PR. I'll see if I can help work through the pain but not sure how much time I'll get over the next few days. Really sorry to create more work for you @idiamond-stripe so will try and help as much as I can.

The pkg/server/telemetry stuff I think could be replaced with #114 via some gRPC middleware rather than wrapping interfaces.

Other changes should still stand but potentially just need moving to different files.

pingles · 2018-07-16T15:50:15Z

@Sambooo is there any chance this is something you could help go through and update to bring in line with the new handler and testing stuff we've done please?

idiamond-stripe · 2018-07-16T16:07:31Z

@pingles, sorry, I haven't had a chance to work on this more. I'll set aside some time this week to try and rebase it on top of your changes :)

…

On Mon, Jul 16, 2018 at 8:50 AM Paul Ingles ***@***.***> wrote: @Sambooo <https://github.com/sambooo> is there any chance this is something you could help go through and update to bring in line with the new handler and testing stuff we've done please? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#113 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AkAmU1qs5cfygzVpONZucMeyxsswPYizks5uHLY4gaJpZM4VGN7c> .

pingles · 2018-07-16T16:22:59Z

That'd be awesome, thanks @idiamond-stripe.

@Sambooo hold off on picking this up for now then please

- Replace go-metrics Prometheus metrics with the Prometheus go client library - Replace go-metrics statsd with a client that does not perform in memory aggregation - Add additional logging as to the duration of requests

idiamond-stripe · 2018-07-21T21:01:41Z

Alright @pingles, rebased on top of your changes and went ahead and added a docs/metrics.md as you suggested. :)

That wasn't too bad, and hopefully everything here makes sense. Wasn't quite sure how to discuss the RPC metrics in the documentation, and if you have any insight that would be great!

pingles

Thanks so much for you hard work to rebase. I've left a few comments inline, some were questions more than suggested changes :)

Really appreciate all this!

pingles · 2018-07-30T14:35:52Z

docs/metrics.md

+### Prometheus
+#### Metadata Subsystem
+- `handler_latency_milliseconds` - Bucketed histogram of handler timings. Tagged by handler
+- `credential_fetch_error` - Number of errors fetching the credentials for a pod


Should this be renamed to credential_fetch_errors_total to indicate its an accumulating counter?

pingles · 2018-07-30T14:36:24Z

docs/metrics.md

+#### Metadata Subsystem
+- `handler_latency_milliseconds` - Bucketed histogram of handler timings. Tagged by handler
+- `credential_fetch_error` - Number of errors fetching the credentials for a pod
+- `credential_encode_error` - Number of errors encoding credentials for a pod


Same as above, should this be credential_encode_errors_total? Or is it not an accumulating counter?

pingles · 2018-07-30T14:37:04Z

pkg/aws/sts/metrics.go

+		prometheus.CounterOpts{
+			Namespace: "kiam",
+			Subsystem: "sts",
+			Name:      "error_issuing_count",


To match the other metrics could we rename to issuing_errors or issuing_errors_total please?

pingles · 2018-07-30T14:39:28Z

docs/metrics.md

+- `cache_miss_total` - Number of cache misses to the metadata cache
+- `error_issuing_count` - Number of errors issuing credentials
+- `assumerole_timing_milliseconds` - Bucketed histogram of assumeRole timings
+- `assume_role_executing_total` - Number of assume role calls currently executing


I'm not sure whether we should use the _total suffix for a gauge? Maybe rename to assumerole_current or assumerole_current_operations?

pingles · 2018-08-02T15:51:12Z

I'm going to close this PR in favour of #131 which picks up your contribution with a few changes. Hope that's ok. Thanks again for all your hard work @idiamond-stripe!

idiamond-stripe force-pushed the idiamond/prometheus-and-statsd branch from a4088c0 to b3301d7 Compare July 7, 2018 04:04

pingles mentioned this pull request Jul 9, 2018

Improve gRPC telemetry integration #114

Closed

pingles reviewed Jul 9, 2018

View reviewed changes

idiamond-stripe force-pushed the idiamond/prometheus-and-statsd branch from f90cd84 to 3907edf Compare July 21, 2018 20:17

Add prometheus metrics and replace statsd

3d120a2

- Replace go-metrics Prometheus metrics with the Prometheus go client library - Replace go-metrics statsd with a client that does not perform in memory aggregation - Add additional logging as to the duration of requests

idiamond-stripe force-pushed the idiamond/prometheus-and-statsd branch from 3907edf to 3d120a2 Compare July 21, 2018 20:40

Add metrics docs

6751d63

idiamond-stripe added 2 commits July 21, 2018 14:03

Rerun dep ensure

acebdfe

Remove test from bad rebase

43dfe68

pingles suggested changes Jul 30, 2018

View reviewed changes

This was referenced Aug 2, 2018

Kiam server readiness/liveness probe issues (v2.8) #94

Closed

idiamond/prometheus and statsd #130

Closed

pingles closed this Aug 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idiamond/prometheus and statsd #113

Idiamond/prometheus and statsd #113

idiamond-stripe commented Jul 7, 2018 •

edited

Loading

pingles commented Jul 9, 2018

pingles left a comment

pingles Jul 9, 2018

idiamond-stripe Jul 21, 2018

pingles Jul 9, 2018

pingles commented Jul 12, 2018

pingles commented Jul 16, 2018

idiamond-stripe commented Jul 16, 2018 via email

pingles commented Jul 16, 2018

idiamond-stripe commented Jul 21, 2018

pingles left a comment

pingles Jul 30, 2018

pingles Jul 30, 2018

pingles Jul 30, 2018

pingles Jul 30, 2018

pingles commented Aug 2, 2018

Idiamond/prometheus and statsd #113

Idiamond/prometheus and statsd #113

Conversation

idiamond-stripe commented Jul 7, 2018 • edited Loading

pingles commented Jul 9, 2018

pingles left a comment

Choose a reason for hiding this comment

pingles Jul 9, 2018

Choose a reason for hiding this comment

idiamond-stripe Jul 21, 2018

Choose a reason for hiding this comment

pingles Jul 9, 2018

Choose a reason for hiding this comment

pingles commented Jul 12, 2018

pingles commented Jul 16, 2018

idiamond-stripe commented Jul 16, 2018 via email

pingles commented Jul 16, 2018

idiamond-stripe commented Jul 21, 2018

pingles left a comment

Choose a reason for hiding this comment

pingles Jul 30, 2018

Choose a reason for hiding this comment

pingles Jul 30, 2018

Choose a reason for hiding this comment

pingles Jul 30, 2018

Choose a reason for hiding this comment

pingles Jul 30, 2018

Choose a reason for hiding this comment

pingles commented Aug 2, 2018

idiamond-stripe commented Jul 7, 2018 •

edited

Loading