This repository has been archived by the owner on Mar 5, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 238
Idiamond metrics #131
Merged
Merged
Idiamond metrics #131
Changes from 8 commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
3d120a2
Add prometheus metrics and replace statsd
idiamond-stripe 6751d63
Add metrics docs
idiamond-stripe acebdfe
Rerun dep ensure
idiamond-stripe 43dfe68
Remove test from bad rebase
idiamond-stripe 3cfbdca
rename some prometheus metrics
pingles 92dfdfe
add documentation for grpc server metrics
pingles e82f7cc
grpc client metrics
pingles d742ced
grpc interceptors must be chained
pingles b025114
clarify statsd metric export
pingles File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
# Metrics | ||
|
||
Kiam exports both Prometheus and StatsD metrics to determine the health of the | ||
system, check the timing of each RPC call, and monitor the size of the | ||
credentials cache. By default, Prometheus metrics are exported on | ||
`localhost:9620` and StatsD metrics are sent to `127.0.0.1:8125`. StatsD | ||
metrics are not aggregated and flushed every 100ms. | ||
|
||
## Metrics configuration | ||
|
||
- The `statsd` flag controls the address to which to send StatsD metrics. This | ||
is by default `127.0.0..1:8125`. If this is blank, StatsD metrics will be | ||
silenced. | ||
- The `statsd-prefix` flag controls the initial prefix that will be appended to | ||
Kiam's StatsD metrics. This is by default `kiam`. | ||
- The `statsd-interval` flag controls how frequently the in-memory metrics | ||
buffer will be flushed to the specified StatsD endpoint. Metrics are | ||
not aggregated in this buffer and the raw counts will be flushed to the | ||
underlying StatsD sink. This is by default `100ms`. | ||
- The `prometheus-listen-addr` controls which address Kiam should create a | ||
Prometheus endpoint on. This is by default `localhost:9620`. The metrics | ||
themselves can be accessed at `<prometheus-listen-addr>/metrics`. | ||
- The `prometheus-sync-interval` flag controls how frequently Prometheus | ||
metrics should be updated. This is by default `5s`. | ||
|
||
## Emitted Metrics | ||
|
||
### Prometheus | ||
|
||
#### Metadata Subsystem | ||
- `handler_latency_milliseconds` - Bucketed histogram of handler timings. Tagged by handler | ||
- `credential_fetch_errors_total` - Number of errors fetching the credentials for a pod | ||
- `credential_encode_errors_total` - Number of errors encoding credentials for a pod | ||
- `find_role_errors_total` - Number of errors finding the role for a pod | ||
- `empty_role_total` - Number of empty roles returned | ||
- `success_total` - Number of successful responses from a handler | ||
- `responses_total` - Responses from mocked out metadata handlers | ||
|
||
#### STS Subsystem | ||
- `cache_hit_total` - Number of cache hits to the metadata cache | ||
- `cache_miss_total` - Number of cache misses to the metadata cache | ||
- `issuing_errors_total` - Number of errors issuing credentials | ||
- `assumerole_timing_milliseconds` - Bucketed histogram of assumeRole timings | ||
- `assumerole_current` - Number of assume role calls currently executing | ||
|
||
#### K8s Subsystem | ||
- `dropped_pods_total` - Number of dropped pods because of full buffer | ||
|
||
#### gRPC Server (Kiam Server) | ||
- `grpc_server_handled_total` - Total number of RPCs completed on the server, regardless of success or failure. | ||
- `grpc_server_msg_received_total` - Total number of RPC stream messages received on the server. | ||
- `grpc_server_msg_sent_total` - Total number of gRPC stream messages sent by the server. | ||
- `grpc_server_started_total` - Total number of RPCs started on the server. | ||
|
||
#### gRPC Client (Kiam Agent) | ||
- `grpc_client_handled_total` - Total number of RPCs completed by the client, regardless of success or failure. | ||
- `grpc_client_msg_received_total` - Total number of RPC stream messages received by the client. | ||
- `grpc_client_msg_sent_total` - Total number of gRPC stream messages sent by the client. | ||
- `grpc_client_started_total` - Total number of RPCs started on the client. | ||
|
||
### StatsD Timing metrics | ||
- `gateway.rpc.GetRole` - Observed client side latency of GetRole RPC | ||
- `gateway.rpc.GetCredentials` - Observed client side latency of GetCredentials RPC | ||
- `server.rpc.GetRoleCredentials` - Observed server side latency of GetRoleCredentials RPC | ||
- `server.rpc.IsAllowedAssumeRole` - Observed server side latency of IsAllowedAssumeRole RPC | ||
- `server.rpc.GetHealth` - Observed server side latency of GetHealth RPC | ||
- `server.rpc.GetPodRole` - Observed server side latency of GetPodRole RPC | ||
- `server.rpc.GetRoleCredentials` - Observed server side latency of GetRoleCredentials RPC | ||
- `handler.role_name` - Observed latency of role_name handler | ||
- `handler.health` - Observed latency of health handler | ||
- `handler.credentials` - Observed latency of credentials handler | ||
- `aws.assume_role` - Observed latency of aws assume role request |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are not aggregated and flushed every 100ms.
I think that should be "are not aggregated and are flushed every 100ms"