-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[connector/spanmetricsconnector] Generated counter drops then disappears #33421
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
I have a few questions that might help us track down this issue: Is there any chance your collector is restarting at these points? Are you running just one collector or many in a gateway mode? |
I'm running the collector as a deployment, and have tried both 1 and 3 replicas. The collector did not restart, I had to terminate the pods to keep exporting the metrics |
I see... honestly at this point I don't quite know what would cause it to eventually stop emitting metrics at all - that's the symptom that is really throwing me for a loop. Are you still having these problems? Can you try increasing resource_metrics_cache_size? The thought is that this might prevent evictions which might prevent the resets. Other things that might help us track down this problem - what is the count of the unique series within count_total over time? Are there resets happening for a series that the TSDB has already gotten or are there entirely new series? |
thanks for your update. Did you try changing the cache size? I'm honestly a little stumped - any ideas @portertech @Frapschen ? |
With the current config the connector will permanently cache every series it sees and send them all during each flush, even the ones where nothing's changed So eventually the payload flushed to
Possible things that could help are:
|
@duc12597 Have your try to switch |
We will consider this option. As of now the collector has been running for 2 weeks without any errors, although there are still counter fluctuations. I'm not sure if it's thanks to any changes on our side. I will close this issue for now and will re-open in the future if this problem resurface. This is my complete collector manifest: apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: 03-sink-metric-prometheus
spec:
image: mirror.gcr.io/otel/opentelemetry-collector-contrib:0.102.0
replicas: 5
nodeSelector:
mycompany.com/service: observability
kubernetes.io/arch: amd64
tolerations:
- effect: NoSchedule
key: mycompany.com/service
value: observability
operator: Equal
config: |
receivers:
prometheus:
config:
scrape_configs:
- job_name: 03-sink-metric-prometheus
scrape_interval: 10s
static_configs:
- targets: ['127.0.0.1:8888']
kafka/traces:
protocol_version: 3.3.1
brokers:
- b-1.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
- b-2.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
auth:
tls:
insecure: true
topic: otlp_spans
group_id: 03-sink-metric-prometheus
kafka/metrics:
protocol_version: 3.3.1
brokers:
- b-1.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
- b-2.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
auth:
tls:
insecure: true
topic: otlp_metrics
group_id: 03-sink-metric-prometheus
processors:
filter:
error_mode: ignore
metrics:
datapoint:
- 'IsMatch(attributes["http.target"], ".*.(css|js)")'
transform:
error_mode: ignore
metric_statements:
- context: datapoint
statements:
# reduce the cardinality of metrics with params
- replace_pattern(attributes["http.target"], "/users/[0-9]{13}", "/users/{userId}")
connectors:
spanmetrics:
dimensions:
- name: http.method
- name: http.target
- name: http.status_code
- name: host.name
- name: myCustomLabel
exclude_dimensions:
- span.kind
- span.name
- status.code
exemplars:
enabled: true
metrics_flush_interval: 15s
metrics_expiration: 1h
resource_metrics_key_attributes:
- service.name
- telemetry.sdk.language
- telemetry.sdk.name
resource_metrics_cache_size: 10000
exporters:
debug:
prometheusremotewrite:
endpoint: http://mimir-nginx/api/v1/push
send_metadata: true
service:
telemetry:
metrics:
address: 127.0.0.1:8888
level: detailed
extensions:
- sigv4auth
pipelines:
traces:
receivers:
- kafka/traces
processors: []
exporters:
- spanmetrics
metrics:
receivers:
- kafka/metrics
- prometheus
- spanmetrics
processors:
- filter
- transform
exporters:
- debug
- prometheusremotewrite
env:
- name: GOMEMLIMIT
value: 1640MiB # 80% of resources.limits.memory
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 500m
memory: 2Gi |
@duc12597 sorry for pinging you, there is a related issue for counter fluctuation, please see #34126 (comment) to fix it. |
If I understand correctly, this will add a UUID as a label for every metric generated by each collector pod. Will this explode the cardinality? Why does a UUID solve the fluctuation? Can you give an example config? Thanks a ton. |
Component(s)
connector/spanmetrics
What happened?
Description
Our collector receives OTLP traces from Kafka, convert them into metrics and export to a TSDB. After a certain period of collector uptime (24-48 hours), the generated
calls_total
counter suffers a significant drop in value. Eventually no more metrics are exported.Steps to Reproduce
Follow the below collector configuration.
Expected Result
The
calls_total
counter is ever increasing.Actual Result
The
calls_total
counter drops then disappears.Collector version
v0.101.0
Environment information
Environment
AWS EKS 1.24
OpenTelemetry Collector configuration
Log output
Additional context
The text was updated successfully, but these errors were encountered: