Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[connector/routing] Outage of one endpoint blocks entire pipeline #31775

Closed
verejoel opened this issue Mar 15, 2024 · 7 comments
Closed

[connector/routing] Outage of one endpoint blocks entire pipeline #31775

verejoel opened this issue Mar 15, 2024 · 7 comments
Assignees
Labels
bug Something isn't working connector/routing Stale

Comments

@verejoel
Copy link

Component(s)

connector/routing

What happened?

Description

We have a use case where we want to route telemetry to different collectors based on a resource attribute. For this, we use the routing connector. We have observed that if one of the endpoints is unavailable, the entire pipeline will be blocked.

Steps to Reproduce

A very minimal example:

    connectors:
      routing:
        default_pipelines:
        - logs/foo
        error_mode: ignore
        table:
        - pipelines:
          - logs/bar
          statement: route() where attributes["service_component"] == "bar"
    
    exporters:
      otlp/foo:
        endpoint: otel-foo-collector:4317
        sending_queue:
          enabled: false
        tls:
          insecure: true
      otlp/bar:
        endpoint: otel-bar-collector:4317
        sending_queue:
          enabled: false
        tls:
          insecure: true

    service:
      pipelines:
        logs/foo:
          exporters:
          - otlp/foo
          receivers:
          - routing

        logs/bar:
          exporters:
          - otlp/bar
          receivers:
          - routing

If either the otlp/bar or otlp/foo endpoints are down, no data will be received on the other endpoint. Effectively, one endpoint outage can cause the entire pipeline to go dark.

Expected Result

I would expect that the routing connector should forward data to all healthy pipelines, and not block all routing in case of one unhealthy pipeline.

Actual Result

A single unhealthy pipeline blocks delivery of all telemetry.

Collector version

0.95.0 (custom build)

Environment information

Environment

Kubernetes 1.28

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

@verejoel verejoel added bug Something isn't working needs triage New item requiring triage labels Mar 15, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jpkrohling jpkrohling removed the needs triage New item requiring triage label Mar 15, 2024
@jpkrohling jpkrohling self-assigned this Mar 15, 2024
@verejoel
Copy link
Author

I have tried a few different configurations, all with the same outcome -> telemetry is blocked if one endpoint is down. Let me flesh out our environment.

All of our logs get shipped to Loki by default, but some of the data needs to also be shipped to additional endpoints (Kafka topics and Azure Eventhub). Call our tenants foo, bar, and baz, and our endpoints loki, kafka/foo, kafka/bar, kafka/baz, and kafka/eventhub. Note that each "endpoint" is actually an OTel collector.

Our collector setup is the following:

ingress gateway -> router -> backend specific collectors

So all our logs are shipped to one endpoint, then forwarded to a routing stage, before being split off into backend specific collectors.

The problem we have is that the inability of any single one of our backend collectors blocks the entire telemetry pipeline.

I have tried a few different concepts for the routing stage. I have tried routing to tenant-specific pipelines, and routing to backend-specific pipelines. Examples of the config for each case below:

# backend-specific routing
    connectors:
      routing:
        default_pipelines:
        - logs/loki
        error_mode: ignore
        table:
        - pipelines:
          - logs/eventhub
          - logs/kafka/foo
          - logs/loki
          statement: route() where attributes["service_component"] == "foo"
        - pipelines:
          - logs/eventhub
          - logs/kafka/bar
          - logs/loki
          statement: route() where attributes["service_component"] == "bar"
        - pipelines:
          - logs/eventhub
          - logs/kafka/baz
          - logs/loki
          statement: route() where attributes["service_component"] == "baz"
    exporters:
      otlp/eventhub:
        endpoint: otel-eventhub-distributor-collector:4317
        sending_queue:
          enabled: false
        tls:
          insecure: true
      otlp/kafka/foo:
        endpoint: otel-kafka-foo-collector:4317
        sending_queue:
          enabled: false
        tls:
          insecure: true
      otlp/kafka/bar:
        endpoint: otel-kafka-bar-collector:4317
        sending_queue:
          enabled: false
        tls:
          insecure: true
      otlp/kafka/proxy:
        endpoint: otel-kafka-proxy-collector:4317
        sending_queue:
          enabled: false
        tls:
          insecure: true
      otlp/loki:
        endpoint: otel-backend-loki-collector:4317
        sending_queue:
          enabled: false
        tls:
          insecure: true
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
    service:
      pipelines:
        logs/incoming:
          exporters:
          - routing
          processors:
          - memory_limiter
          receivers:
          - otlp
        logs/eventhub:
          exporters:
          - otlp/eventhub
          receivers:
          - routing
        logs/kafka/foo:
          exporters:
          - otlp/kafka/foo
          receivers:
          - routing
        logs/kafka/bar:
          exporters:
          - otlp/kafka/bar
          receivers:
          - routing
        logs/kafka/baz:
          exporters:
          - otlp/kafka/baz
          receivers:
          - routing
        logs/loki:
          exporters:
          - otlp/loki
          receivers:
          - routing

And tenant-specific routing:

# tenant-specific routing
    connectors:
      routing:
        default_pipelines:
        - logs/default
        error_mode: ignore
        table:
        - pipelines:
          - logs/foo
          statement: route() where attributes["service_component"] == "foo"
        - pipelines:
          - logs/bar
          statement: route() where attributes["service_component"] == "bar"
        - pipelines:
          - logs/baz
          statement: route() where attributes["service_component"] == "baz"
    exporters:
      otlp/eventhub:
        endpoint: otel-eventhub-distributor-collector:4317
        sending_queue:
          enabled: false
        tls:
          insecure: true
      otlp/kafka/foo:
        endpoint: otel-kafka-foo-collector:4317
        sending_queue:
          enabled: false
        tls:
          insecure: true
      otlp/kafka/bar:
        endpoint: otel-kafka-bar-collector:4317
        sending_queue:
          enabled: false
        tls:
          insecure: true
      otlp/kafka/baz:
        endpoint: otel-kafka-baz-collector:4317
        sending_queue:
          enabled: false
        tls:
          insecure: true
      otlp/loki:
        endpoint: otel-backend-loki-collector:4317
        sending_queue:
          enabled: false
        tls:
          insecure: true
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
    service:
      pipelines:
        logs/incoming:
          exporters:
          - routing
          processors:
          - memory_limiter
          receivers:
          - otlp
        logs/foo
          exporters:
          - otlp/eventhub
          - otlp/kafka/foo
          - otlp/loki
          receivers:
          - routing
        logs/bar
          exporters:
          - otlp/eventhub
          - otlp/kafka/bar
          - otlp/loki
          receivers:
          - routing
        logs/baz
          exporters:
          - otlp/eventhub
          - otlp/kafka/baz
          - otlp/loki
          receivers:
          - routing
        logs/default:
          exporters:
          - otlp/loki
          receivers:
          - routing

Both situations are vulnerable in case one of the otlp exporters cannot ship data.

@verejoel
Copy link
Author

As suggested by @jpkrohling I tested using the forward connector and filtering. This didn't work either (same behaviour, one dead pipeline kills them all).

I think it’s quite hard to decouple pipelines in the collector, it seems to be baked in at a very low level…

The fanout consumer seems to be used when you have receivers / exporters used in multiple pipelines. And this guy runs synchronously, so one failure blocks everything. I think this is the root cause, it’s not actually something routing connector specific.

Can we use the exporter helper to specify whether a given exporter should be considered "blocking" or not? Or would making the fanoutconsumer asynchronous help?

@verejoel
Copy link
Author

Having thought some more about this. Here's where I am:

  • it is correct that backpressure should be applied to the entire pipeline in case one endpoint is down
  • it would be useful to be able to prioritize telemetry within a pipeline and apply backpressure to specific tenants rather than to all telemetry
  • currently we could do this with separate contexts per-tenant. However, as of right now the sending_queue seems to be the blocker, as it is not sharded by incoming context

So I think the work here needs to happen in the exporter helper, and we need to optionally shard retries / sending queue by incoming context.

@jpkrohling
Copy link
Member

So I think the work here needs to happen in the exporter helper

I think that's what we are going towards. See open-telemetry/opentelemetry-collector#8122

Copy link
Contributor

github-actions bot commented Jul 1, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jul 1, 2024
@jpkrohling
Copy link
Member

I'm closing this for now, as we seem to agree that this should be handled at the exporter helper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working connector/routing Stale
Projects
None yet
Development

No branches or pull requests

2 participants