This repository has been archived by the owner on Mar 5, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 239
Idiamond/prometheus and statsd #113
Closed
idiamond-stripe
wants to merge
4
commits into
uswitch:master
from
idiamond-stripe:idiamond/prometheus-and-statsd
Closed
Changes from 1 commit
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Add metrics docs
- Loading branch information
commit 6751d633c7dd3b32212190b102a60417de730655
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
# Metrics | ||
|
||
Kiam exports both Prometheus and StatsD metrics to determine the health of the | ||
system, check the timing of each RPC call, and monitor the size of the | ||
credentials cache. By default, Prometheus metrics are exported on | ||
`localhost:9620` and StatsD metrics are sent to `127.0.0.1:8125`. StatsD | ||
metrics are not aggregated and flushed every 100ms. | ||
|
||
## Metrics configuration | ||
|
||
- The `statsd` flag controls the address to which to send StatsD metrics. This | ||
is by default `127.0.0..1:8125`. If this is blank, StatsD metrics will be | ||
silenced. | ||
- The `statsd-prefix` flag controls the initial prefix that will be appended to | ||
Kiam's StatsD metrics. This is by default `kiam`. | ||
- The `statsd-interval` flag controls how frequently the in-memory metrics | ||
buffer will be flushed to the specified StatsD endpoint. Metrics are | ||
not aggregated in this buffer and the raw counts will be flushed to the | ||
underlying StatsD sink. This is by default `100ms`. | ||
- The `prometheus-listen-addr` controls which address Kiam should create a | ||
Prometheus endpoint on. This is by default `localhost:9620`. The metrics | ||
themselves can be accessed at `<prometheus-listen-addr>/metrics`. | ||
- The `prometheus-sync-interval` flag controls how frequently Prometheus | ||
metrics should be updated. This is by default `5s`. | ||
|
||
## Emitted Metrics | ||
|
||
### Prometheus | ||
#### Metadata Subsystem | ||
- `handler_latency_milliseconds` - Bucketed histogram of handler timings. Tagged by handler | ||
- `credential_fetch_error` - Number of errors fetching the credentials for a pod | ||
- `credential_encode_error` - Number of errors encoding credentials for a pod | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same as above, should this be |
||
- `find_role_error_total` - Number of errors finding the role for a pod | ||
- `empty_role_total` - Number of empty roles returned | ||
- `success_total` - Number of successful responses from a handler | ||
- `responses_total` - Responses from mocked out metadata handlers | ||
|
||
#### STS Subsystem | ||
- `cache_hit_total` - Number of cache hits to the metadata cache | ||
- `cache_miss_total` - Number of cache misses to the metadata cache | ||
- `error_issuing_count` - Number of errors issuing credentials | ||
- `assumerole_timing_milliseconds` - Bucketed histogram of assumeRole timings | ||
- `assume_role_executing_total` - Number of assume role calls currently executing | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure whether we should use the |
||
|
||
#### K8s Subsystem | ||
- `dropped_pods_total` - Number of dropped pods because of full cache | ||
|
||
### StatsD Timing metrics | ||
- `gateway.rpc.GetRole` - Observed client side latency of GetRole RPC | ||
- `gateway.rpc.GetCredentials` - Observed client side latency of GetCredentials RPC | ||
- `server.rpc.GetRoleCredentials` - Observed server side latency of GetRoleCredentials RPC | ||
- `server.rpc.IsAllowedAssumeRole` - Observed server side latency of IsAllowedAssumeRole RPC | ||
- `server.rpc.GetHealth` - Observed server side latency of GetHealth RPC | ||
- `server.rpc.GetPodRole` - Observed server side latency of GetPodRole RPC | ||
- `server.rpc.GetRoleCredentials` - Observed server side latency of GetRoleCredentials RPC | ||
- `handler.role_name` - Observed latency of role_name handler | ||
- `handler.health` - Observed latency of health handler | ||
- `handler.credentials` - Observed latency of credentials handler | ||
- `aws.assume_role` - Observed latency of aws assume role request |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be renamed to
credential_fetch_errors_total
to indicate its an accumulating counter?