Skip to content
This repository has been archived by the owner on Mar 5, 2024. It is now read-only.

Idiamond/prometheus and statsd #113

Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add metrics docs
  • Loading branch information
idiamond-stripe committed Jul 21, 2018
commit 6751d633c7dd3b32212190b102a60417de730655
59 changes: 59 additions & 0 deletions docs/metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Metrics

Kiam exports both Prometheus and StatsD metrics to determine the health of the
system, check the timing of each RPC call, and monitor the size of the
credentials cache. By default, Prometheus metrics are exported on
`localhost:9620` and StatsD metrics are sent to `127.0.0.1:8125`. StatsD
metrics are not aggregated and flushed every 100ms.

## Metrics configuration

- The `statsd` flag controls the address to which to send StatsD metrics. This
is by default `127.0.0..1:8125`. If this is blank, StatsD metrics will be
silenced.
- The `statsd-prefix` flag controls the initial prefix that will be appended to
Kiam's StatsD metrics. This is by default `kiam`.
- The `statsd-interval` flag controls how frequently the in-memory metrics
buffer will be flushed to the specified StatsD endpoint. Metrics are
not aggregated in this buffer and the raw counts will be flushed to the
underlying StatsD sink. This is by default `100ms`.
- The `prometheus-listen-addr` controls which address Kiam should create a
Prometheus endpoint on. This is by default `localhost:9620`. The metrics
themselves can be accessed at `<prometheus-listen-addr>/metrics`.
- The `prometheus-sync-interval` flag controls how frequently Prometheus
metrics should be updated. This is by default `5s`.

## Emitted Metrics

### Prometheus
#### Metadata Subsystem
- `handler_latency_milliseconds` - Bucketed histogram of handler timings. Tagged by handler
- `credential_fetch_error` - Number of errors fetching the credentials for a pod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be renamed to credential_fetch_errors_total to indicate its an accumulating counter?

- `credential_encode_error` - Number of errors encoding credentials for a pod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, should this be credential_encode_errors_total? Or is it not an accumulating counter?

- `find_role_error_total` - Number of errors finding the role for a pod
- `empty_role_total` - Number of empty roles returned
- `success_total` - Number of successful responses from a handler
- `responses_total` - Responses from mocked out metadata handlers

#### STS Subsystem
- `cache_hit_total` - Number of cache hits to the metadata cache
- `cache_miss_total` - Number of cache misses to the metadata cache
- `error_issuing_count` - Number of errors issuing credentials
- `assumerole_timing_milliseconds` - Bucketed histogram of assumeRole timings
- `assume_role_executing_total` - Number of assume role calls currently executing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether we should use the _total suffix for a gauge? Maybe rename to assumerole_current or assumerole_current_operations?


#### K8s Subsystem
- `dropped_pods_total` - Number of dropped pods because of full cache

### StatsD Timing metrics
- `gateway.rpc.GetRole` - Observed client side latency of GetRole RPC
- `gateway.rpc.GetCredentials` - Observed client side latency of GetCredentials RPC
- `server.rpc.GetRoleCredentials` - Observed server side latency of GetRoleCredentials RPC
- `server.rpc.IsAllowedAssumeRole` - Observed server side latency of IsAllowedAssumeRole RPC
- `server.rpc.GetHealth` - Observed server side latency of GetHealth RPC
- `server.rpc.GetPodRole` - Observed server side latency of GetPodRole RPC
- `server.rpc.GetRoleCredentials` - Observed server side latency of GetRoleCredentials RPC
- `handler.role_name` - Observed latency of role_name handler
- `handler.health` - Observed latency of health handler
- `handler.credentials` - Observed latency of credentials handler
- `aws.assume_role` - Observed latency of aws assume role request
2 changes: 1 addition & 1 deletion pkg/k8s/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ var (
Namespace: "kiam",
Subsystem: "k8s",
Name: "dropped_pods_total",
Help: "Dropped pods because of full cache",
Help: "Number of dropped pods because of full cache",
},
)
)
Expand Down