Skip to content

Conversation

@bavarianbidi
Copy link
Contributor

@bavarianbidi bavarianbidi commented Feb 21, 2024

This PR add couple of new metrics to count interactions with the github API and external provider binaries.

i've followed the recommendation of the prometheus-folks when it came to metrics for this kind of problem: https://promlabs.com/blog/2023/09/19/errors-successes-totals-which-metrics-should-i-expose-to-prometheus/#recommended-for-binary-outcomes-exposing-errors-and-totals

provider binary example metrics:

# HELP garm_runner_operations_total Total number of instance operation attempts
# TYPE garm_runner_operations_total counter
garm_runner_operation_total{operation="CreateInstance",provider="kubernetes_external"} 2
garm_runner_operation_total{operation="DeleteInstance",provider="kubernetes_external"} 2

github api call example metrics:

# HELP garm_github_errors_total Total number of failed github operation attempts
# TYPE garm_github_errors_total counter
garm_github_errors_total{operation="GenerateOrgJITConfig",scope="Organization"} 2
# HELP garm_github_operations_total Total number of github operation attempts
# TYPE garm_github_operations_total counter
garm_github_operations_total{operation="GenerateOrgJITConfig",scope="Organization"} 2
garm_github_operations_total{operation="ListOrganizationRunnerApplicationDownloads",scope="Organization"} 2
garm_github_operations_total{operation="ListOrganizationRunners",scope="Organization"} 1
garm_github_operations_total{operation="RemoveRunner",scope="Organization"} 4

towards #181

TODO:

  • document new metrics
  • discuss golangci.yml with metrics linter --> will do another PR with a first golangci.yaml for this project
  • do a second round and search for all github api calls

Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
introduce metrics counter for github api calls

Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
metricsLabelEnterpriseScope, // label: scope
).Inc()
if resp != nil && resp.StatusCode == http.StatusUnauthorized {
return nil, nil, fmt.Errorf("failed to get JIT config: %w", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no related to this change.

Note to self: This needs to wrap runnerErrors.ErrUnauthorized

@gabriel-samfira
Copy link
Member

This looks good to me. All this makes me think we might benefit from some telemetry as well. Maybe in a future release we could leverage OpenTelemetry.

I could merge this if you're satisfied with it. Let me know. If you're still working on it, feel free to convert it to draft so I don't accidentally merge it prematurely 😁

@bavarianbidi
Copy link
Contributor Author

This looks good to me. All this makes me think we might benefit from some telemetry as well. Maybe in a future release we could leverage OpenTelemetry.

I could merge this if you're satisfied with it. Let me know. If you're still working on it, feel free to convert it to draft so I don't accidentally merge it prematurely 😁

OpenTelemetry is definitely (IMHO) the next thing when it came to observability. TBH i didn't spend that much time into it to understand how/what we need on our side to use it within our company internal managed prometheus service.
But as you said, something totally different and unrelated to this PR.

Will convert the PR to draft to finalize it tomorrow morning (especially the second round of looking for external api-calls).

@bavarianbidi bavarianbidi marked this pull request as draft February 21, 2024 18:38
@gabriel-samfira
Copy link
Member

OpenTelemetry is definitely (IMHO) the next thing when it came to observability. TBH i didn't spend that much time into it to understand how/what we need on our side to use it within our company internal managed prometheus service.
But as you said, something totally different and unrelated to this PR.

definitely something for future releases, not even for the next one. otel allows us to trace calls throughout the codebase and allows us to add context to those traces, like what we're calling functions with, what error they return. With that info, we can then generate detailed graphs of how a call traveled throughout the code base. Further more, this can be correlated even with external providers.

But again, this is for much further into the future. For now, and especially in this PR, metrics are the focus.

Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
@bavarianbidi bavarianbidi marked this pull request as ready for review February 22, 2024 04:58
@bavarianbidi
Copy link
Contributor Author

I've found three other API calls i missed in the first round - but for now it looks good - feel free to merge whenever you want

@gabriel-samfira gabriel-samfira merged commit dd6f1e4 into cloudbase:main Feb 22, 2024
bavarianbidi pushed a commit to mercedes-benz/garm that referenced this pull request May 22, 2024
extend metrics for github and provider executions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants