Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to Otel-Collector 0.93.0 and adjust resource settings to compensate changes #771

Closed
a-thaler opened this issue Feb 5, 2024 · 1 comment
Assignees
Labels
area/metrics MetricPipeline area/traces TracePipeline kind/feature Categorizes issue or PR as related to a new feature.
Milestone

Comments

@a-thaler
Copy link
Collaborator

a-thaler commented Feb 5, 2024

Description

The newest otel-collector version has some important changes. The goal of the ticket is to update to that version.

Actions included in the update:

  • update otel-collector to 0.93.0
  • understand and apply the memory-limiter changes -> different retry-handling on agent side
    • run performance tests for metric-agent and come up with optimized numbers
  • memory ballast extension is deprecated, understand and decide on alternatives
  • while doing trace perf tests, re-evaluate memory request for trace-collector

Outcome

  • The new extension memory_limiter introduced with OpenTelemetry Collector release 0.93 will potentially replace existing memory_limiter processor but not available yet and is still under development, current implementation of memory_limiter extension is just copy of memory_limiter processor (see here)

  • The recent release of OpenTelemetry Collector 0.93 deliver fix for OTEL gRPC receiver return wrong error, in case of some back pressure scenarios like high memory usage when exporter can't export telemetry data (too many request or backend outages) propagated wrongly back to the sender side and interpreted by the sender as permanent error which result data drops without retrying. See logs below for bug from previous version and recent 0.93 for same test with simulated backend outages with MetricPipeline with Metric Agent setup

OTEL Col v0.92

{"level":"error","ts":1706703769.5337207,"caller":"exporterhelper/retry_sender.go:102","msg":"Exporting failed. The error is not retryable. Dropping data.","kind":"exporter","data_type":"metrics","name":"otlp","error":"Permanent error: rpc error: code = Unknown desc = data refused due to high memory usage","dropped_items":1024,"stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send\n\tgo.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/retry_sender.go:102\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send\n\tgo.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/metrics.go:170\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).consume\n\tgo.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/queue_sender.go:123\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue[...]).Consume\n\tgo.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/internal/bounded_memory_queue.go:57\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.(*QueueConsumers[...]).Start.func1\n\tgo.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/internal/consumers.go:43"}

OTEL Col v0.93

{"level":"info","ts":1706705051.4944198,"caller":"exporterhelper/retry_sender.go:118","msg":"Exporting failed. Will retry the request after interval.","kind":"exporter","data_type":"metrics","name":"otlp","error":"rpc error: code = Unavailable desc = data refused due to high memory usage","interval":"42.120435243s"}

With this test backend outages create back pressure on Metric Gateway side and this propagated back to the Metric Agent as non-permanent error, so Metric Agent will queue exports and retry after configured intervals.

Conclusion

  • New memory_limiter extension is not ready yet, no action required here but this new extension most probably will replace existing processor implementation therefor we should watch the development and start adapt new extension in our setup as early as possible.
  • The bug fix Ensure OTLP receiver handles consume errors correctly open-telemetry/opentelemetry-collector#4335 solve the problem dropped telemetry data when export fails due to next receiver refusal in the chain. This impact MetricPipeline component. First test show bug fixed and propagation of error result with export retry. The Metric Agent Backpressure test with this fix show different memory usage characteristic than previous version and result OOM. The MetricPipeline (potentially TracePipeline too) requires new test and new memory_limiter processor adjustments.
@hisarbalik
Copy link
Contributor

Missing part GOMEMLIMIT and PoC with TracePipeline documented

@a-thaler a-thaler added this to the 1.9.0 milestone Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics MetricPipeline area/traces TracePipeline kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

2 participants