-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spanmetrics Connector] sometimes counter-type metrics grow exponentially. #33136
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
I resolved this issue by removing consumer and producer related dimensions. i don`t know why this issue happens. so i decided to change like this. SERVER span metrics -> server related dimensions only else : Drop all |
Not resolved this issue.. still occurs. |
I tried this but still occurs.
|
Could give me any advice ? If you have any additional questions or suspicions, please let �me know Or is there something I'm setting up incorrectly? |
@pingping95 You added Updating your promSQL with |
@Frapschen Thanks for replying ! Im not added cluster, service_name label to by () because security.
It seems that put, get http.method is strange. |
Hi, If you could provide us a list of ~50 metrics from the normal case (before your spike) and 50ish during the spike, we could help you figure out what the origin of the spike is and whether the span metrics connector is involved in the problem. We would want the metric_name and full set of labels for each metric. You could definitely anonymize this data however you see fit - but keeping the label keys the same would really help. |
@ankitpatel96 Hi. thanks for help! because of your help, i found something new. counter-type metrics does not grow exponentially. when removing rate() function, calls_total metrics looks drop at once. spanMetrics connector is Stateful component ? I use loadBalancing OTEL Collector in front of spnaMetrics connector. and it is routed by serviceName.
The time when the metric looks strange and the time when the Collector containing the spanMetrics Connector is restarted are exactly the same. (8h , 17h) Do i need to run spanMetrics connector using PVC ? First of all, I found out the cause because of your help. Thank you. |
I gonna try to add new label that increase cardinality (for example, pod_id to calls_total metrics) in order to check if issue exists or not, i would try restart collector pod. If the same phenomenon does not occur after restarting, I think i can conclude that the issue is not a problem with spanmetrics connectors, but rather the absence of labels that increase cardinality in the exporter component. https://grafana.com/docs/grafana-cloud/monitor-applications/application-observability/setup/scaling/ |
I`ll close this issue. When issue occurs again, then i`ll re-open issue again. thanks |
Component(s)
connector/spanmetrics
What happened?
Description
Sometimes counter-type metrics grow exponentially.
This has been happening for about 2 months now.
I would run a heapdump if it was a memory leak,
but the metric values are growing exponentially, so I don't know how to debug it.
This didn't happen when I was using Tempo's Metrics Generator.
However, in order to create RED metrics with Tempo's Metrics Generator, I couldn't sample from the Collector, so I moved to the Opentelemetry Collector.
Steps to Reproduce
It doesn't happen in the development environment.
The issue occurs on collectors with some traffic, like production environment.
Because of the
metric_expiration
config, it returns back to normal after 5 minutes.but this is only a temporary measure and doesn’t help much.
Expected Result
The counter type metric should not be spiking.
Actual Result
Collector version
v0.99
Environment information
Environment
OS: Amazon Linux 2 (AWS EKS)
Compiler(if manually compiled): I don`t know
Architecture
2 Layer Collectors
OpenTelemetry Collector configuration
Log output
Additional context
I'm constantly trying to disable spanmetrics settings one by one to figure out what's causing the problem.
So far, nothing has worked.
The text was updated successfully, but these errors were encountered: