-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[exporter/prometheusremotewrite] "target" metric causes out-of-order samples errors on load balanced collector deployment #12300
Comments
We're also seeing this issue when using the prometheusreceiver with prometheusremotewriteexporter. It seems that the prometheusreceiver converts scrape target label as a datapoint attribute instead of as resource attribute (which seems more appropriate imo), so collectors with the same prometheusreceiver config will have the same labelset on the |
What resource attributes are attached to your OTLP metrics when you send them? The set of resource attributes should uniquely identify the sender, otherwise there would be collisions even if there was one collector per-sender. If resource attributes uniquely identifies the sender, the current timestamp logic should work. If it sends batch 1, and then batch 2, the timestamp of batch 1 will be before batch 2, since the datapoint timestamps of points in batch 1 are all before those in batch 2. There is always the chance that the two batches will be reordered in transit, but there is that chance for any metric.
Can you expand on this? Labels on the target_info metric should be converted to resource attributes. |
also cc @Aneurysm9 |
Here's my test collector config:
I looked at the metrics in fileexporter_output and saw something like the following
Version: 0.52.0 Expected: the static target's label (pod: testpod) should be a resource metric attribute. This means that when the metric is exported by the prometheusremotewriteexporter, the label that helps distinguish the target (ie pod) will not be added as a label to the Essentially, because the |
I think we are probably doing the right thing in the prometheus receiver... The |
opt-out for generating target_info also works since prometheusreceiver already generates |
There appears to be a "single-writer principle" violation happening. I understand the desire to opt-out, but can we somehow address the underlying problem as well? |
we have a setup where the otlp sender pod is responsible for always sending a timeseries to the same collector pod in a k8s replicaset (using consistent hashing), therefore our otlp sender doesn't append its own info like the sender's k8s pod name as a resource attribute. There can be hundreds of otlp senders. If we added pod name it would increase the cardinality of all senders' timeseries 2x when senders are redeployed, therefore we do not want to add it to every time series, but this unfortunately means the target info metric from all the senders are indistinguishable. Another option for us is to only append a unique label to only the target metric, but this would be against the spec |
This is kind of similar to #11870. There seem to be use cases where for one reason or another, the sender's unique ID isn't necessary or desirable as a resource attribute. But not having unique sender ID would always result in the target metric from multiple senders conflicting, whether it's multiple senders sending to one collector, or each sender sending to a different collector but the collectors all writing to the same Prometheus. Hence the need to either disable target metric, or allow a label to be added to only the target metric and not others |
Is that actually the case? Won't it only increase the cardinality of the target_info series? I wouldn't expect that series to represent a meaningful portion of overall cardinality in most cases. |
I'm not convinced that the implementation of (I couldn't find where the specification talks about this detail, does it? Forgive me if the following is incorrect.) From inspection, the Prometheus receiver inserts the
On the surface, it appears that removing the If I were asked to design this feature today, I would probably remove it from That way, OTLP consumers will not see "job" and "instance" attributes repeated in every metric exported by a Prometheus receiver. Is there an issue or design discussion somewhere we can refer to for more information on this topic? |
Job and instance are converted to
That is currently the case, since job and instance are converted to See prom->otlp and otlp->prom for details. It was proposed in open-telemetry/opentelemetry-specification#2381. |
Thank you @dashpole. Sorry for mis-reading the code. I reviewed prom->otlp and otlp->prom specs and remembered 2381 (however, will add that the string "target_info" does not appear in the spec today?). @hardproblems it sounds like you are looking for "Safe attribute removal" in the collector, but I'm still a bit confused. If job and instance are not unique in your setup, is it also true that your application metrics have no overlapping attribute sets? You can avoid a single-writer violation by avoiding overlap, but this means you're maintaining uniqueness across senders. Is this true? If you've got not overlapping application metrics, then "job" and "instance" are not necessarily identifying, so they can be safely removed without special aggregation. This will explain why you only see errors for In the hypothetical safe-attribute-removal processor documented in open-telemetry/opentelemetry-specification#1297, under these circumstances with |
Correct, we use consistent hashing to ensure a unique timeseries is only ever sent to one collector so we only see out of order samples on the Safe removal of attributes is a possible alternative, if we knew ahead of time which resource attributes there are. This is the case for us today so I can give that a try. Will there never be any cases in the future where resource metrics are dynamically created (maybe derived from spans) where the attributes are not known at configuration time? |
Yes, I even managed to confuse myself on that front: #12079 (comment). I'll update the spec to reference target_info (rather than an info-typed metric in the |
Sending to a unique collector does not necessarily avoid a single-writer violation. Since your metrics are exiting the collector through PRW, I assume they end up in a database where you will ask PromQL queries? Single-writer violations can still show up at this point. You are either vulnerable to this, or you have somehow ensured that your timeseries do not overlap aside from For example, Process A writes counter values [10, 11, 12, 13, 14], and Process B writes counter values [0, 1, 2, 3, 4]. It's possible for these to be interleaved in the storage layer so that the database contains [0, 10, 1, 11, 2, 12, 3, 13, 4, 14] if they actually use the same attributes (and you will not be able to correctly compute the rate from this data). |
oh i meant that we're using consistent hashing upstream of the collector to maintain the uniqueness guarantee. A time series (unique combinations of metric name and label values)'s data points from multiple senders will only get sent to one collector. An example of a metric sender is a pod in a replicated load-balanced service that want to report the latency of requests that it served. We don't care about the sender's identity, only the overall percentile / histogram of request latencies see by all senders. The metric sender will use consistent hashing to send each unique time series to a single collector pod so they are aggregated before being send to Prometheus. However, opentelemetry-go sdk adds resource attributes like |
I don't think you've explained how this avoids the single-writer violation. It sounds like you might have invalid timeseries data, if multiple collectors receive overlapping data and write it to the same Prometheus remote-write backend. Prometheus cannot correctly compute rates from interleaved timeseries (actually overlapping).
I understand now, and I support the option to not generate target_info as a simple workaround. It sounds like you could add a resource processor and drop all the superfluous attributes, at which point the target_info would contain precisely Stepping back, I filed the issue about safe attribute removal because I absolutely think the collector should support what you're trying to do, i.e, when you send overlapping metrics on-purpose. It's not a trivial task, however--you end up making tough decisions about how to handle late arriving data. In a Statsd metrics system, it would be commonplace to do the kind of aggregation you are describing, but that relies on a feature of the Statsd protocol: points are not timestamped. They arrive "not late" by definition. We can impose a condition similar to statsd that data is "not late" by definition in order to simplify the processor needed to correctly, safely remove attributes from metrics data in a collector. |
Would you please link to the doc on single-writer principle? My understanding of that is a unique time series should only be sent by a single writer, which we can guarantee in our environment. |
I don't want to make it sound like you're doing something wrong, I was just having trouble understanding how your series are unique when |
application pods --> metric aggregator pods --> per-pod otel-collector is a very high level description of our setup. |
That makes sense. Thank you for being specific and bearing with me. We've fully explained why only the Definitely, the intention behind a "safe attribute removal processor" is to support what you're doing in the aggregator pods as an otel-collector pipeline instead. Thanks! |
Question around this bug, I think we're encountering it, in the process of upgrading to the latest contrib. Has this potential to negative impact other metrics? Cause we've noticed that things like counters in prometheus invariably seem to be 10% off so I'm wondering if they're getting swallowed as a result of this. |
@mrblonde91 No, this shouldn't impact any other metrics. |
Describe the bug
We operate a cluster of multiple opentelemetry-collector configured to be load balanced and autoscaled. We recently updated from 0.35 to 0.49. This update caused our backend to start complaining about out-of-order samples on the "target" metric:
We think the issue stems from how the timestamp of the "target" metric is generated. The timestamp of the sample is generated from the batch of metrics and because multiple samples are generated concurrently in different instance of opentelemetry, they may not align and end up in the backend with the timestamp out of order. In addition, because every instance of opentelemetry emits the metric with the exact same set of labels, it's causing metric collisions.
Steps to reproduce
What did you expect to see?
We wish to have an additionnal label just for the "target" metric. This label would be allow to have a separate serie for each instance of opentelemetry-collector.
external_labels
is not appropriate for our use-case because it affect all metrics and causes duplication issues, cardinality explosion and generally a big mess.What did you see instead?
Many errors in the logs of opentelemetry-collector and multiple dropped samples of the "target" metric because of an out-of-order samples error.
What version did you use?
0.49
What config did you use?
Environment
deployed in AWS EKS
remotewrite endpoint is AWS Amazon Managed Service for Prometheus
Additional context
The "target" metric has been added in the following pr: #8493. There is an interesting comment about the choice of the timestamp of the sample.
I am willing to work on a pr.
The text was updated successfully, but these errors were encountered: