Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove old ingress-rules metrics for prometheus scraping #11047

Open
SilentEntity opened this issue Mar 1, 2024 · 7 comments
Open

Remove old ingress-rules metrics for prometheus scraping #11047

SilentEntity opened this issue Mar 1, 2024 · 7 comments
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@SilentEntity
Copy link

What happened:

Once you update the ingress rule. The Ingress controller is still providing metrics for old rules (plus new rules), which increases cardinality and generates not-useful (dumb) data (for old removed rules) while Prometheus scrapes on the pod.

What you expected to happen:

Once the rules are updated or removed, the metrics from the old data should be removed, which reduces the cardinality and avoids providing not-useful data (for old removed/updated rules).

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):

Kubernetes version (use kubectl version): Not relevant

Environment:

  • Cloud provider or hardware configuration:

  • OS (e.g. from /etc/os-release): not relevant

  • Kernel (e.g. uname -a): not relevant

  • Install tools: EKS, AKS and bare metal

    • Please mention how/where was the cluster created like kubeadm/kops/minikube/kind etc.
  • Basic cluster related info:

    • kubectl version
    • kubectl get nodes -o wide
  • How was the ingress-nginx-controller installed:

    • If helm was used then please show output of helm ls -A | grep -i ingress
    • If helm was used then please show output of helm -n <ingresscontrollernamespace> get values <helmreleasename>
    • If helm was not used, then copy/paste the complete precise command used to install the controller, along with the flags and options used
    • if you have more than one instance of the ingress-nginx-controller installed in the same cluster, please provide details for all the instances

How to reproduce this issue:

Add 100 rules, update the same rule, or reduce them to 10. The Ingress controller will provide the metrics data for old and new rules.

Increase in cardinality:

cat metrics | grep -v "#" |cut -d "{" -f1  | sort | uniq -c | sort -rn | head -n40
3048 nginx_ingress_controller_request_duration_seconds_bucket
2988 nginx_ingress_controller_response_duration_seconds_bucket
2988 nginx_ingress_controller_connect_duration_seconds_bucket
2820 nginx_ingress_controller_header_duration_seconds_bucket
2794 nginx_ingress_controller_response_size_bucket
2794 nginx_ingress_controller_request_size_bucket
2032 nginx_ingress_controller_bytes_sent_bucket
 254 nginx_ingress_controller_response_size_sum
 254 nginx_ingress_controller_response_size_count
 254 nginx_ingress_controller_requests
 254 nginx_ingress_controller_request_size_sum
 254 nginx_ingress_controller_request_size_count
 254 nginx_ingress_controller_request_duration_seconds_sum
 254 nginx_ingress_controller_request_duration_seconds_count
 254 nginx_ingress_controller_bytes_sent_sum
 254 nginx_ingress_controller_bytes_sent_count
 249 nginx_ingress_controller_response_duration_seconds_sum
 249 nginx_ingress_controller_response_duration_seconds_count
 249 nginx_ingress_controller_connect_duration_seconds_sum
 249 nginx_ingress_controller_connect_duration_seconds_count
 235 nginx_ingress_controller_header_duration_seconds_sum
 235 nginx_ingress_controller_header_duration_seconds_count

After you restart the pod:

cat metrics | grep -v "#" |cut -d "{" -f1  | sort | uniq -c | sort -rn | head -n40
 288 nginx_ingress_controller_response_duration_seconds_bucket
 288 nginx_ingress_controller_request_duration_seconds_bucket
 288 nginx_ingress_controller_header_duration_seconds_bucket
 288 nginx_ingress_controller_connect_duration_seconds_bucket
 264 nginx_ingress_controller_response_size_bucket
 264 nginx_ingress_controller_request_size_bucket
 192 nginx_ingress_controller_bytes_sent_bucket
  24 nginx_ingress_controller_response_size_sum
  24 nginx_ingress_controller_response_size_count
  24 nginx_ingress_controller_response_duration_seconds_sum
  24 nginx_ingress_controller_response_duration_seconds_count
  24 nginx_ingress_controller_requests
  24 nginx_ingress_controller_request_size_sum
  24 nginx_ingress_controller_request_size_count
  24 nginx_ingress_controller_request_duration_seconds_sum
  24 nginx_ingress_controller_request_duration_seconds_count
  24 nginx_ingress_controller_header_duration_seconds_sum
  24 nginx_ingress_controller_header_duration_seconds_count
  24 nginx_ingress_controller_connect_duration_seconds_sum
  24 nginx_ingress_controller_connect_duration_seconds_count
  24 nginx_ingress_controller_bytes_sent_sum
  24 nginx_ingress_controller_bytes_sent_count
  21 nginx_ingress_controller_ingress_upstream_latency_seconds
  19 nginx_ingress_controller_orphan_ingress
   7 nginx_ingress_controller_ingress_upstream_latency_seconds_sum
   7 nginx_ingress_controller_ingress_upstream_latency_seconds_count

Anything else we need to know:

@SilentEntity SilentEntity added the kind/bug Categorizes issue or PR as related to a bug. label Mar 1, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Mar 1, 2024
@longwuyuan
Copy link
Contributor

/help

@SilentEntity thanks for reporting this.

  • Yes, you are right and this has been going on for a long time
  • Another typical example is expired cert will continue showing up, even after the related ingress is deleted
  • But personally I am waiting for clarity from someone on the aspect of the data being a timeseries. The context being, the old rule metrics being present and the metrics from a deleted ingress's cert being present are timeseries data that a user may continue to view in grafana (or get from raw prometheus), in future

So I don't think this is a bug unless we can discuss and triage it to be a bug. So lets wait for expert comments and opinions

/assign

@k8s-ci-robot
Copy link
Contributor

@longwuyuan:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

@SilentEntity thanks for reporting this.

  • Yes, you are right and this has been going on for a long time
  • Another typical example is expired cert will continue showing up, even after the related ingress is deleted
  • But personally I am waiting for clarity from someone on the aspect of the data being a timeseries. The context being, the old rule metrics being present and the metrics from a deleted ingress's cert being present are timeseries data that a user may continue to view in grafana (or get from raw prometheus), in future

So I don't think this is a bug unless we can discuss and triage it to be a bug. So lets wait for expert comments and opinions

/assign

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Mar 1, 2024
@longwuyuan
Copy link
Contributor

/remove-kind bug

@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. labels Mar 1, 2024
Copy link

github-actions bot commented Apr 1, 2024

This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev on Kubernetes Slack.

@github-actions github-actions bot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Apr 1, 2024
@SilentEntity
Copy link
Author

Old or expired metrics data, anyhow won't be present in the new pod(while scaling) or restarted pod which will create discrepancies in the metrics or grafana dashboard.

@github-actions github-actions bot removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Apr 4, 2024
@jakuboskera
Copy link

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
Development

No branches or pull requests

4 participants