-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[connector/routing] Outage of one endpoint blocks entire pipeline #31775
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
I have tried a few different configurations, all with the same outcome -> telemetry is blocked if one endpoint is down. Let me flesh out our environment. All of our logs get shipped to Loki by default, but some of the data needs to also be shipped to additional endpoints (Kafka topics and Azure Eventhub). Call our tenants Our collector setup is the following:
So all our logs are shipped to one endpoint, then forwarded to a routing stage, before being split off into backend specific collectors. The problem we have is that the inability of any single one of our backend collectors blocks the entire telemetry pipeline. I have tried a few different concepts for the routing stage. I have tried routing to tenant-specific pipelines, and routing to backend-specific pipelines. Examples of the config for each case below: # backend-specific routing
connectors:
routing:
default_pipelines:
- logs/loki
error_mode: ignore
table:
- pipelines:
- logs/eventhub
- logs/kafka/foo
- logs/loki
statement: route() where attributes["service_component"] == "foo"
- pipelines:
- logs/eventhub
- logs/kafka/bar
- logs/loki
statement: route() where attributes["service_component"] == "bar"
- pipelines:
- logs/eventhub
- logs/kafka/baz
- logs/loki
statement: route() where attributes["service_component"] == "baz"
exporters:
otlp/eventhub:
endpoint: otel-eventhub-distributor-collector:4317
sending_queue:
enabled: false
tls:
insecure: true
otlp/kafka/foo:
endpoint: otel-kafka-foo-collector:4317
sending_queue:
enabled: false
tls:
insecure: true
otlp/kafka/bar:
endpoint: otel-kafka-bar-collector:4317
sending_queue:
enabled: false
tls:
insecure: true
otlp/kafka/proxy:
endpoint: otel-kafka-proxy-collector:4317
sending_queue:
enabled: false
tls:
insecure: true
otlp/loki:
endpoint: otel-backend-loki-collector:4317
sending_queue:
enabled: false
tls:
insecure: true
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
service:
pipelines:
logs/incoming:
exporters:
- routing
processors:
- memory_limiter
receivers:
- otlp
logs/eventhub:
exporters:
- otlp/eventhub
receivers:
- routing
logs/kafka/foo:
exporters:
- otlp/kafka/foo
receivers:
- routing
logs/kafka/bar:
exporters:
- otlp/kafka/bar
receivers:
- routing
logs/kafka/baz:
exporters:
- otlp/kafka/baz
receivers:
- routing
logs/loki:
exporters:
- otlp/loki
receivers:
- routing And tenant-specific routing: # tenant-specific routing
connectors:
routing:
default_pipelines:
- logs/default
error_mode: ignore
table:
- pipelines:
- logs/foo
statement: route() where attributes["service_component"] == "foo"
- pipelines:
- logs/bar
statement: route() where attributes["service_component"] == "bar"
- pipelines:
- logs/baz
statement: route() where attributes["service_component"] == "baz"
exporters:
otlp/eventhub:
endpoint: otel-eventhub-distributor-collector:4317
sending_queue:
enabled: false
tls:
insecure: true
otlp/kafka/foo:
endpoint: otel-kafka-foo-collector:4317
sending_queue:
enabled: false
tls:
insecure: true
otlp/kafka/bar:
endpoint: otel-kafka-bar-collector:4317
sending_queue:
enabled: false
tls:
insecure: true
otlp/kafka/baz:
endpoint: otel-kafka-baz-collector:4317
sending_queue:
enabled: false
tls:
insecure: true
otlp/loki:
endpoint: otel-backend-loki-collector:4317
sending_queue:
enabled: false
tls:
insecure: true
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
service:
pipelines:
logs/incoming:
exporters:
- routing
processors:
- memory_limiter
receivers:
- otlp
logs/foo
exporters:
- otlp/eventhub
- otlp/kafka/foo
- otlp/loki
receivers:
- routing
logs/bar
exporters:
- otlp/eventhub
- otlp/kafka/bar
- otlp/loki
receivers:
- routing
logs/baz
exporters:
- otlp/eventhub
- otlp/kafka/baz
- otlp/loki
receivers:
- routing
logs/default:
exporters:
- otlp/loki
receivers:
- routing Both situations are vulnerable in case one of the otlp exporters cannot ship data. |
As suggested by @jpkrohling I tested using the forward connector and filtering. This didn't work either (same behaviour, one dead pipeline kills them all). I think it’s quite hard to decouple pipelines in the collector, it seems to be baked in at a very low level… The fanout consumer seems to be used when you have receivers / exporters used in multiple pipelines. And this guy runs synchronously, so one failure blocks everything. I think this is the root cause, it’s not actually something routing connector specific. Can we use the exporter helper to specify whether a given exporter should be considered "blocking" or not? Or would making the fanoutconsumer asynchronous help? |
Having thought some more about this. Here's where I am:
So I think the work here needs to happen in the exporter helper, and we need to optionally shard retries / sending queue by incoming context. |
I think that's what we are going towards. See open-telemetry/opentelemetry-collector#8122 |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
I'm closing this for now, as we seem to agree that this should be handled at the exporter helper. |
Component(s)
connector/routing
What happened?
Description
We have a use case where we want to route telemetry to different collectors based on a resource attribute. For this, we use the
routing
connector. We have observed that if one of the endpoints is unavailable, the entire pipeline will be blocked.Steps to Reproduce
A very minimal example:
If either the
otlp/bar
orotlp/foo
endpoints are down, no data will be received on the other endpoint. Effectively, one endpoint outage can cause the entire pipeline to go dark.Expected Result
I would expect that the
routing
connector should forward data to all healthy pipelines, and not block all routing in case of one unhealthy pipeline.Actual Result
A single unhealthy pipeline blocks delivery of all telemetry.
Collector version
0.95.0 (custom build)
Environment information
Environment
Kubernetes 1.28
OpenTelemetry Collector configuration
No response
Log output
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: