-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Googlemanagedprometheus exporter randomly falls into an infinite error state #31507
Comments
That error usually means you either have multiple collectors trying to write the same set of metrics, or that the metrics being sent contain duplicates. Are you able to reproduce this with a single collector? I see |
I have only 1 instance of this collector at the same time. We are also using labels with a Namespace and a Pod name (Deployment) which is de facto unique. E.g. let's take this error: And like described in the topic - we are just starting this 1 Pod and sometimes it works correctly all the time and sometimes it prints errors all the time. We are just building our own image with Maybe there is a problem in our configuration but for me, it looks like wrong behavior of the collector (exporter?), especially looking at that the smaller timeout causes bigger probability of the problem on the Pod. Do you have some tip how to look for the duplicated data point? The metrics in error logs are looking fine for themselves. |
Can you turn off sampling in the logging exporter, and export to both GMP and logging? |
I assume you are using the downward API to set the pod name, then? |
Yes. In the original post I simplified my collector configuration - actually we are using 2 receivers and pipelines - one for internal otel metrics and one for metrics from our application. I didn't find it relevant, but now it looks like it may be important. In the original config we have interval for App metrics 60s and Otel metrics 30s - then we got duplicate errors once in a minute. So, analyzing those times, it looks like the problem appears when we scrape app and otel metrics at the same time. Maybe some pods didn't have the problem because there were some random offset between app and otel metric scrapes? With logging exporter there is a big number of logs but I am adding some segment of it here: And here is the config for it: Config
I guess some "scrape_offset" parameter for "prometheusreceiver" would be a workaround for this problem but I don't see anything like that. |
Usually the error message gives you a particular metric + labels that failed. If you can find what the logging exporter prints just before the export for that metric, that might point to how it ended up duplicated. If that doesn't give you enough info, you can paste it here, and I might be able to figure out why it resulted in the error. If that still doesn't work, and you are able to get the OTLP in json using the json exporter, I can actually replay it using our testing framework and figure out why it isn't working. |
I don't see anything unusual there.
Just before that there is different metric described "otel_internal_scrape_series_added". And Earlier, there is "otel_internal_otelcol_process_runtime_total_sys_memory" where we can see that there is only 1 data point. And like you can see in my description in the thread - each minute there is different metric in error log. How can I use Json Exporter? I don't see it in the repo. |
Ah, sorry. Its called the file exporter: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/fileexporter I'll try and reproduce it with your config above. |
I'm sending 10MB of metrics from the pod with problems: |
I've run the first 22 batches of metrics through the replay mechanism, grouped by the timestamps, which covers the first 90 seconds. I haven't been able to produce any errors. GoogleCloudPlatform/opentelemetry-operations-go#809 It looks like the errors occur every minute, so I should have found one by now. Do you have any of the error logs from during that run? |
Here are the logs from this exact Pod since startup. At the beginning there is its configuration.
I see that the first timestamp in metrics is 15:43:36, next are 15:44:04 (app metrics) and 15:44:06 (otel internal metrics) and the first error log 15:44:08, so I think so. We are running collector on Pod in kubernetes with CPU 100m-300m, so it is running only on 1 core. Not sure if it may influence the behavior of 2 pipelines running at the same time, as the difference is 2 seconds and the batching processor is waiting 5 seconds. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners: See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been closed as inactive because it has been stale for 120 days with no activity. |
Component(s)
exporter/googlemanagedprometheus
What happened?
Description
Sometimes, when pod in GKE with OpenTelemetry Collector starts up, it reports errors "One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric." in every consequent minute (scrape interval is 30s). After restarting pod the problem disappears. After some more restarts, problem happens again.
Looks like all the metrics are send properly to Google Monitoring but every minute additional duplicated data points are added to the batch, which causes the errors.
Steps to Reproduce
Create pod in Google Kubernetes Engine with OpenTelemetry Collector with similar config to ours. If problem does not occur, delete the pod and recreate it. Repeat until you see consistent error logs.
Expected Result
If there is any problem with saving data point to Google Monitoring which causes sending duplicated data point next minute, it should not repeat infinitely each minute.
Actual Result
Error with duplicated data point causes infinite errors state of OpenTelemetry exporter, fixed only when pod is deleted.
Collector version
v0.95.0
Environment information
Environment
Google Kubernetes Engine
Base image: ubi9/ubi
Compiler(if manually compiled): go 1.21.7
OpenTelemetry Collector configuration
Log output
Additional context
I made some additional tests and it looks like googlemanagedprometheus timeout could be related to the problem.
With timeout 10s I got 5 pods with errors on 12 pods started.
With timeout 15s I got 1 pod with errors on 20 pods started.
So, maybe there is a problem with exporting timeout but this behavior with infinite errors does not look correct.
Histogram for 10s timeout:
Histogram for 15s timeout:
almost 2 hours of errors later (the same pod):
Blue rectangle mean new Pod started. Red rectangle mean the error described in this issue.
All pods are exactly the same, just with a different name with random suffix.
The text was updated successfully, but these errors were encountered: