Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional exported metrics for the metrics apiserver #4460

Closed
BojanZelic opened this issue Apr 17, 2023 · 14 comments · Fixed by #4982
Closed

Additional exported metrics for the metrics apiserver #4460

BojanZelic opened this issue Apr 17, 2023 · 14 comments · Fixed by #4982
Assignees
Labels
feature All issues for new features that have been committed to needs-discussion

Comments

@BojanZelic
Copy link
Contributor

BojanZelic commented Apr 17, 2023

Proposal

Provide similar exported metrics that the Kubernetes metrics-server has that are useful for tracking SLOs for KEDA (ex latency, or % of successful requests, ect...)

ex:

apiserver_request_total{group="external.metrics.k8s.io"}
apiserver_request_slo_duration_seconds{group="external.metrics.k8s.io"}

Use-Case

I would like to get the number of requests that have failed to keda api server; currently this isn't possible, with metrics-server, I can do something like:

(
  sum(rate(apiserver_request_total{group="metrics.k8s.io", code=~"(5..|429)"}[5m])) +
  sum(rate(apiserver_request_terminations_total{group="metrics.k8s.io"}[5m]))
)

This is because apiserver_request_total{group="metrics.k8s.io"} is exposed by the metrics-server and metrics-server uses the apiserver package from kubernetes

KEDA could expose similar metrics either by using the apiserver package from Kubernetes or implementing something similar;

Is this a feature you are interested in implementing yourself?

Yes

Anything else?

the aggregation api from kubernetes tracks failed requests to external.metrics.k8s.io via apiserver_request_terminations_total ; it does not track the total # of requests though; There's currently no way to get the total # of requests to external.metrics.k8s.io since the keda metrics apiserver doesn't expose this information;

@BojanZelic BojanZelic added feature-request All issues for new features that have not been committed to needs-discussion labels Apr 17, 2023
@JorTurFer
Copy link
Member

hi @BojanZelic
It totally makes sense, I assign the issue to you (based on your response).
I have a question about this, will you implement something similar in the operator too?

@BojanZelic
Copy link
Contributor Author

I was going to implement it for the api server only; Because the operator already exposes prometheus metrics for reconciliation time, errors, # of reconciliations, ect...

controller_runtime_reconcile_time_seconds_sum
controller_runtime_reconcile_time_seconds_count
controller_runtime_reconcile_time_seconds_bucket
controller_runtime_reconcile_total
workqueue_work_duration_seconds_bucket

But the api server lacks observability for ensuring correct responses & response times;

@JorTurFer
Copy link
Member

The metrics server should have also the same metrics because it's built on top of the operator framework

@BojanZelic
Copy link
Contributor Author

It looks like operator framework provided metrics exist for the keda api server under port 8080, the keda_metrics_adapter_* metrics are under port 9022;

which is great for figuring out if the MetricsScaledObjectReconciler is encountering errors;

But what I want to achieve is wether or not the request is succeeding or failing from the perspective of the HPA controller
ex:

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/mynamespace/mymetric"

& figure out what percentage of those requests fail;

so the concept doesn't apply to the operator? unless you mean the grpc communication between the metrics server & the operator, in that case, yes I could add those same metrics there

@JorTurFer
Copy link
Member

Okey... it's complicated...
We deprecated the metrics server's prometheus server in port 9022 in KEDA v2.9 , so in theory we should remove now (as part of next release). It's true that metrics server exposes 2 prometheus server, but all the metrics should be added to the runtime server (port 8080) because the other will be removed (keda_metrics_adapter_* are deprecated).
You can add new metrics to the metrics server but in the runtime server, and there, you can add these new metrics that you proposed.

From operator pov, I guess that we can expose similar metrics based on the gRPC request, WDYT?

@JorTurFer
Copy link
Member

I plan to remove them during this week, so if you wait until that, the metrics server will have only one Prometheus server, making the things easier xD

@BojanZelic
Copy link
Contributor Author

makes sense! thanks for the explanation. The new metrics can be added to :8080

as for grpc metrics, it seems like we're using https://github.com/grpc-ecosystem/go-grpc-middleware so having it use the prometheus middleware should be straightforward and would give us grpc prometheus metrics ; - https://github.com/grpc-ecosystem/go-grpc-middleware/tree/main/providers/prometheus

@JorTurFer
Copy link
Member

#3972

@tomkerkhove tomkerkhove added feature All issues for new features that have been committed to and removed feature-request All issues for new features that have not been committed to labels Apr 26, 2023
@zroubalik
Copy link
Member

@BojanZelic fyi, the deprecated prometheus server from Metrics Server has been removed.

I don't follow the part about gRPC, what kind of stuff do you want to expose there? Thanks!

@JorTurFer
Copy link
Member

We have removed the second Prometheus server but they can still add those metrics to the operator-runtime prometheus (exposed by port 8080 in the metrics server).

@BojanZelic
Copy link
Contributor Author

So an update

after some digging I discovered that there's actually another metric endpoint not documented that exposes these metrics already without a code change. It needs to be accessed from an allowed role & I updated the docs here: kedacore/keda-docs#1222

I ended up not needing the GRPC metrics as they weren't useful and essentially would report similar latency metrics as apiserver_request_sli_duration_seconds_bucket

I tried to combine the metrics so that they're exposed via the controller-runtime port (8080) and endpoint but was unsuccessful since:

and there's no easy way to combine the 2 since they both alter global variables, use internal packages and make it difficult to change implementation details and both also register the same collectors (ex: collectors.NewProcessCollector & collectors.NewGoCollector())

@JorTurFer
Copy link
Member

Nice research!!!
I thought that we weren't using k8s.io/apiserver, but it looks that we do it. I'm not sure if we can handle it from KEDA directly but maybe we can do something like the custom-metrics-apiserver does: https://github.com/kubernetes-sigs/custom-metrics-apiserver/blob/master/pkg/apiserver/metrics/metrics.go#L37-L40

Exposing 2 metrics server doesn't make sense IMHO and we should choose one. If there isn't any way to merge them, maybe we have to open issues on the upstream to allow this somehow. @zroubalik ?

@BojanZelic
Copy link
Contributor Author

BojanZelic commented Sep 17, 2023

That makes sense on exposing only 1 port; I took another stab at it, and I was able to get it working when exposing via the 6443 port, but not the controller-runtime port (8080); Just because of the abstractions that k8s.io/apiserver library put in place with their prometheus wrappers makes it difficult;

Because the prometheus registries have conflict, we can unregister some of the go & process collectors from controller-runtime and expose both of them to the 6443 port;

something like:
see: https://github.com/kubernetes-sigs/metrics-server/blob/master/pkg/server/config.go#L83

and then metric server just calls ServeHttp on both handlers:
https://github.com/kubernetes-sigs/metrics-server/blob/master/pkg/server/config.go#L117-L118

The only caviat here is that this endpoint has authentication and needs an rbac rule for access - see kedacore/keda-docs#1222 (comment);

Got it to work under the existing :8080 port

What do you think?

@BojanZelic
Copy link
Contributor Author

@JorTurFer I created a PR to consolidate the metrics to one port; PTAL #4982

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature All issues for new features that have been committed to needs-discussion
Projects
Archived in project
4 participants