Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loadbalancing: Collector fails to start if k8s_resolver encounters issues with watch/list endpoints #33804

Open
khyatigandhi0612 opened this issue Jun 28, 2024 · 2 comments
Assignees
Labels
bug Something isn't working exporter/loadbalancing

Comments

@khyatigandhi0612
Copy link

Component(s)

exporter/loadbalancing

What happened?

Description

The loadbalancing exporter in the OpenTelemetry Collector Contrib package fails to start the collector when using the k8s resolver if it fails to watch/list endpoints. There are continuous errors for the same.
This can occur in many scenarios.
For example:
Missing Role/RoleBinding: The collector pod doesn't have the required role or role binding to access Kubernetes API resources.
Incorrect Service Name: The k8s resolver configuration within the loadbalancing exporter specifies an invalid service name.
In both cases, the k8s resolver fails to retrieve the target endpoint for trace export, leading to the collector startup failure.

Steps to Reproduce

Deploy an OpenTelemetry collector with the loadbalancing exporter configured to use the k8s resolver.
Option 1: Missing Permissions:
Do not assign any role or role binding to the collector pod service account.
Option 2: Incorrect Service Name:
Configure the k8s resolver in the loadbalancing exporter with a non-existent service name.
Start the collector deployment.

Expected Result

The OpenTelemetry collector should start successfully even if the k8s resolver initially fails to retrieve the target endpoint due to missing permissions or an incorrect service name. The collector should continue attempting to connect to the k8s API in the background for exporting traces. But other pipelines should function as expected.

Actual Result

The collector fails to start and becomes unavailable for export of other telemetry data in pipeline

Collector version

v0.95.0

Environment information

Kubernetes cluster

OpenTelemetry Collector configuration

exporters:
   debug: {}
   loadbalancing:
     protocol:
       otlp:
        timeout: 10s
        endpoint: localhost
        tls:
          insecure: true
     resolver:
       k8s:
         service: tailsampling-svc.tailsampler
 extensions:
   health_check:
     endpoint: ${env:MY_POD_IP}:13133
 processors:
   batch: {}
   memory_limiter:
     check_interval: 10s
     limit_percentage: 80
     spike_limit_percentage: 25
 receivers:
   jaeger:
     protocols:
       grpc:
         endpoint: ${env:MY_POD_IP}:14250
       thrift_compact:
         endpoint: ${env:MY_POD_IP}:6831
       thrift_http:
         endpoint: ${env:MY_POD_IP}:14268
   otlp:
     protocols:
       grpc:
         endpoint: ${env:MY_POD_IP}:4317
       http:
         endpoint: ${env:MY_POD_IP}:4318
   prometheus:
     config:
       scrape_configs:
       - job_name: opentelemetry-collector
         scrape_interval: 10s
         static_configs:
         - targets:
           - ${env:MY_POD_IP}:8888
   zipkin:
     endpoint: ${env:MY_POD_IP}:9411
 service:
   extensions:
   - health_check
   pipelines:
     logs:
       exporters:
       - debug
       processors:
       - memory_limiter
       - batch
       receivers:
       - otlp
     traces:
       exporters:
       - debug
       - loadbalancing
       processors:
       - memory_limiter
       - batch
       receivers:
       - otlp

Log output

2024-06-28T09:37:04.914Z	info	service@v0.103.0/service.go:115	Setting up own telemetry...
2024-06-28T09:37:04.914Z	info	service@v0.103.0/telemetry.go:96	Serving metrics	{"address": ":8888", "level": "Normal"}
2024-06-28T09:37:04.914Z	info	exporter@v0.103.0/exporter.go:280	Development component. May change in the future.	{"kind": "exporter", "data_type": "logs", "name": "debug"}
2024-06-28T09:37:04.914Z	info	exporter@v0.103.0/exporter.go:280	Development component. May change in the future.	{"kind": "exporter", "data_type": "traces", "name": "debug"}
2024-06-28T09:37:04.915Z	info	memorylimiter/memorylimiter.go:160	Using percentage memory limiter	{"kind": "processor", "name": "memory_limiter", "pipeline": "traces", "total_memory_mib": 15976, "limit_percentage": 80, "spike_limit_percentage": 25}
2024-06-28T09:37:04.915Z	info	memorylimiter/memorylimiter.go:77	Memory limiter configured	{"kind": "processor", "name": "memory_limiter", "pipeline": "traces", "limit_mib": 12781, "spike_limit_mib": 3994, "check_interval": 10}
2024-06-28T09:37:04.915Z	warn	jaegerreceiver@v0.103.0/factory.go:49	jaeger receiver will deprecate Thrift-gen and replace it with Proto-gen to be compatbible to jaeger 1.42.0 and higher. See https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/18485 for more details.	{"kind": "receiver", "name": "jaeger", "data_type": "traces"}
2024-06-28T09:37:04.915Z	info	service@v0.103.0/service.go:182	Starting otelcol-k8s...	{"Version": "0.103.1", "NumCPU": 10}
2024-06-28T09:37:04.915Z	info	extensions/extensions.go:34	Starting extensions...
2024-06-28T09:37:04.915Z	info	extensions/extensions.go:37	Extension is starting...	{"kind": "extension", "name": "health_check"}
2024-06-28T09:37:04.915Z	info	healthcheckextension@v0.103.0/healthcheckextension.go:32	Starting health_check extension	{"kind": "extension", "name": "health_check", "config": {"Endpoint":"10.1.1.32:13133","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"ResponseHeaders":null,"CompressionAlgorithms":null,"Path":"/","ResponseBody":null,"CheckCollectorPipeline":{"Enabled":false,"Interval":"5m","ExporterFailureThreshold":5}}}
2024-06-28T09:37:04.915Z	info	extensions/extensions.go:52	Extension started.	{"kind": "extension", "name": "health_check"}
2024-06-28T09:37:04.915Z	info	otlpreceiver@v0.103.0/otlp.go:102	Starting GRPC server	{"kind": "receiver", "name": "otlp", "data_type": "logs", "endpoint": "10.1.1.32:4317"}
2024-06-28T09:37:04.915Z	info	otlpreceiver@v0.103.0/otlp.go:152	Starting HTTP server	{"kind": "receiver", "name": "otlp", "data_type": "logs", "endpoint": "10.1.1.32:4318"}
W0628 09:37:04.918760       1 reflector.go:539] k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: failed to list *v1.Endpoints: endpoints "tailsampling-svc" is forbidden: User "system:serviceaccount:default:my-opentelemetry-collector" cannot list resource "endpoints" in API group "" in the namespace "tailsampler"
E0628 09:37:04.918786       1 reflector.go:147] k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: endpoints "tailsampling-svc" is forbidden: User "system:serviceaccount:default:my-opentelemetry-collector" cannot list resource "endpoints" in API group "" in the namespace "tailsampler"
W0628 09:37:06.037354       1 reflector.go:539] k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: failed to list *v1.Endpoints: endpoints "tailsampling-svc" is forbidden: User "system:serviceaccount:default:my-opentelemetry-collector" cannot list resource "endpoints" in API group "" in the namespace "tailsampler"
E0628 09:37:06.037425       1 reflector.go:147] k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: endpoints "tailsampling-svc" is forbidden: User "system:serviceaccount:default:my-opentelemetry-collector" cannot list resource "endpoints" in API group "" in the namespace "tailsampler"

Additional context

No response

@khyatigandhi0612 khyatigandhi0612 added bug Something isn't working needs triage New item requiring triage labels Jun 28, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jpkrohling jpkrohling self-assigned this Jul 8, 2024
@jpkrohling jpkrohling removed the needs triage New item requiring triage label Jul 8, 2024
Copy link
Contributor

github-actions bot commented Sep 9, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Sep 9, 2024
@jpkrohling jpkrohling removed the Stale label Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/loadbalancing
Projects
None yet
Development

No branches or pull requests

2 participants