Skip to content
This repository has been archived by the owner on Mar 5, 2024. It is now read-only.

Idiamond/prometheus and statsd #113

Conversation

idiamond-stripe
Copy link
Contributor

@idiamond-stripe idiamond-stripe commented Jul 7, 2018

This change replaces go-metrics with the standard Prometheus library and emits StatsD timing metrics for each request, rather than aggregating it in memory.

This PR creates metrics.go files on the package level for each package it monitors and then emits metrics in the centralized prometheus package. This also creates a statsd package that can emit metrics to a StatsD sink on demand.

Avoiding in-memory metric aggregation means aggregation can be performed in the metrics engine like SIgnalFX or DataDog or Graphite. This allows users to more easily monitor the p99s as they'd like and determine when individual requests endure latency spikes, rather than simply seeing the view in aggregate.

  • Replace go-metrics Prometheus metrics with the Prometheus go client library and emit all previous go-metrics metrics that way
  • Replace go-metrics StatsD with a client that does not perform in-memory aggregation and emit timing metrics that way
  • Add additional logging as to the duration of requests

Resolves #35

@idiamond-stripe idiamond-stripe force-pushed the idiamond/prometheus-and-statsd branch from a4088c0 to b3301d7 Compare July 7, 2018 04:04
@pingles
Copy link
Contributor

pingles commented Jul 9, 2018

@idiamond-stripe wow, thanks- this looks great.

I'm going to focus on trying to get #109 merged and then review this with a little more time. It all looks good- only thing I'd like to check is the names of metrics that are published/tracked.

Is there any chance you could add a doc/telemetry.md file that has a table of the different metric names and what they measure please? Would be super useful- both for this and afterwards for people operating kiam.

Copy link
Contributor

@pingles pingles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks- I've added a few comments throughout. I quite like the model of defining a bunch of vars for the various metrics, makes it easier to see a list of everything that's tracked.

Biggest issue is likely to be the removal of the telemetry decorator type but hopefully that's either easily replaced with gRPC middleware of adding it back.

credentialTimings := metrics.GetOrRegisterTimer("credentialsHandler", metrics.DefaultRegistry)
startTime := time.Now()
defer credentialTimings.UpdateSince(startTime)
timer := prometheus.NewTimer(handlerTimer.WithLabelValues("credentials"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with the prometheus client library- is it expected to create timers inline with function calls or should the timer be constructed and stored as a field on the credentialsHandler struct?

I guess this is also true for the statsd timer below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to just create a timer inline. A timer in the Prometheus library is just a single timing for a function call, so it makes sense to defer the completion of it until the end of the fn.

timer := metrics.GetOrRegisterTimer("client.GetPodRole", metrics.DefaultRegistry)
startTime := time.Now()
defer timer.UpdateSince(startTime)
timer := prometheus.NewTimer(rpcTimer.With(prometheus.Labels{"rpc": "GetPodRole", "type": "client"}))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really sorry but in #109 I removed the telemetry client/server interfaces. Mostly that was to make it easier to move stuff but it always niggled me that it wasn't the right/best way to do this.

If there's a more idiomatic way that'd be great to incorporate- perhaps middleware?

I've created another issue #114 also to track this and make sure it's done pre-v3 release.

@pingles
Copy link
Contributor

pingles commented Jul 12, 2018

I've now merged the mega #109 PR which changed a bunch of stuff around. Unfortunately that breaks a bunch of the changes in this PR. I'll see if I can help work through the pain but not sure how much time I'll get over the next few days. Really sorry to create more work for you @idiamond-stripe so will try and help as much as I can.

The pkg/server/telemetry stuff I think could be replaced with #114 via some gRPC middleware rather than wrapping interfaces.

Other changes should still stand but potentially just need moving to different files.

@pingles
Copy link
Contributor

pingles commented Jul 16, 2018

@Sambooo is there any chance this is something you could help go through and update to bring in line with the new handler and testing stuff we've done please?

@idiamond-stripe
Copy link
Contributor Author

idiamond-stripe commented Jul 16, 2018 via email

@pingles
Copy link
Contributor

pingles commented Jul 16, 2018

That'd be awesome, thanks @idiamond-stripe.

@Sambooo hold off on picking this up for now then please

@idiamond-stripe idiamond-stripe force-pushed the idiamond/prometheus-and-statsd branch from f90cd84 to 3907edf Compare July 21, 2018 20:17
- Replace go-metrics Prometheus metrics with the Prometheus go client
library
- Replace go-metrics statsd with a client that does not perform in
memory aggregation
- Add additional logging as to the duration of requests
@idiamond-stripe idiamond-stripe force-pushed the idiamond/prometheus-and-statsd branch from 3907edf to 3d120a2 Compare July 21, 2018 20:40
@idiamond-stripe
Copy link
Contributor Author

Alright @pingles, rebased on top of your changes and went ahead and added a docs/metrics.md as you suggested. :)

That wasn't too bad, and hopefully everything here makes sense. Wasn't quite sure how to discuss the RPC metrics in the documentation, and if you have any insight that would be great!

Copy link
Contributor

@pingles pingles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for you hard work to rebase. I've left a few comments inline, some were questions more than suggested changes :)

Really appreciate all this!

### Prometheus
#### Metadata Subsystem
- `handler_latency_milliseconds` - Bucketed histogram of handler timings. Tagged by handler
- `credential_fetch_error` - Number of errors fetching the credentials for a pod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be renamed to credential_fetch_errors_total to indicate its an accumulating counter?

#### Metadata Subsystem
- `handler_latency_milliseconds` - Bucketed histogram of handler timings. Tagged by handler
- `credential_fetch_error` - Number of errors fetching the credentials for a pod
- `credential_encode_error` - Number of errors encoding credentials for a pod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, should this be credential_encode_errors_total? Or is it not an accumulating counter?

prometheus.CounterOpts{
Namespace: "kiam",
Subsystem: "sts",
Name: "error_issuing_count",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To match the other metrics could we rename to issuing_errors or issuing_errors_total please?

- `cache_miss_total` - Number of cache misses to the metadata cache
- `error_issuing_count` - Number of errors issuing credentials
- `assumerole_timing_milliseconds` - Bucketed histogram of assumeRole timings
- `assume_role_executing_total` - Number of assume role calls currently executing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether we should use the _total suffix for a gauge? Maybe rename to assumerole_current or assumerole_current_operations?

@pingles
Copy link
Contributor

pingles commented Aug 2, 2018

I'm going to close this PR in favour of #131 which picks up your contribution with a few changes. Hope that's ok. Thanks again for all your hard work @idiamond-stripe!

@pingles pingles closed this Aug 2, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants