Skip to content

[Tracing] received corrupt message of type InvalidContentType - When Collector is not in mesh, or has ports marked inbound skip #13427

Open
@jseiser

Description

@jseiser

What is the issue?

Linkerd breaks traces when running against the OTLP collector, confusing the collector into thinking the traces come from the collector itself, not from the originating pod.

Example: grafana/alloy#1336 (comment)

As a work around, we wanted to just remove the collector from the mesh, that breaks linkerd-proxy being able to send traces. We then attempted to leave the collector in the mesh, but tell it to skip the relevant ports inbound, that also breaks linkerd-proxy being able to send traffic.

How can it be reproduced?

  1. AWS EKS Cluster
  2. Linkerd
  3. Grafana Alloy
  4. Configure Linkerd-jaeger to send traces to Grafana Alloy

Logs, error output, etc

This happens with Alloy removed from the mesh, or with alloy set to skip the ports in bound.

{"timestamp":"2024-12-04T16:30:47.984933Z","level":"WARN","fields":{"message":"Failed to connect","error":"endpoint 10.1.7.129:4317: received corrupt message of type InvalidContentType"},"target":"linkerd_reconnect","threadId":"ThreadId(1)"}
{"timestamp":"2024-12-04T16:30:48.202478Z","level":"WARN","fields":{"message":"Failed to connect","error":"endpoint 10.1.16.177:4317: received corrupt message of type InvalidContentType"},"target":"linkerd_reconnect","threadId":"ThreadId(1)"}
{"timestamp":"2024-12-04T16:30:48.229660Z","level":"WARN","fields":{"message":"Failed to connect","error":"endpoint 10.1.16.177:4318: received corrupt message of type InvalidContentType"},"target":"linkerd_reconnect","threadId":"ThreadId(1)"}

output of linkerd check -o short

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready
√ cluster networks contains all pods
√ cluster networks contains all services

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ proxy-init container runs as root user if docker container runtime is used

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
√ policy-validator webhook has valid cert
√ policy-validator cert is valid for at least 60 days

linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
    is running version 24.11.4 but the latest edge version is 24.11.8
    see https://linkerd.io/2/checks/#l5d-version-cli for hints

control-plane-version
---------------------
√ can retrieve the control plane version
‼ control plane is up-to-date
    is running version 24.11.4 but the latest edge version is 24.11.8
    see https://linkerd.io/2/checks/#l5d-version-control for hints
√ control plane and cli versions match

linkerd-control-plane-proxy
---------------------------
√ control plane proxies are healthy
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-c9b6b96c-lwwv4 (edge-24.11.4)
	* linkerd-destination-c9b6b96c-m44vg (edge-24.11.4)
	* linkerd-destination-c9b6b96c-xm56z (edge-24.11.4)
	* linkerd-identity-7d9c687659-mftdm (edge-24.11.4)
	* linkerd-identity-7d9c687659-qcq5l (edge-24.11.4)
	* linkerd-identity-7d9c687659-sj5zp (edge-24.11.4)
	* linkerd-proxy-injector-7b5f5c7d66-c2q6z (edge-24.11.4)
	* linkerd-proxy-injector-7b5f5c7d66-cltg2 (edge-24.11.4)
	* linkerd-proxy-injector-7b5f5c7d66-nn9t7 (edge-24.11.4)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
√ control plane proxies and cli versions match

linkerd-ha-checks
-----------------
√ multiple replicas of control plane pods

linkerd-extension-checks
------------------------
√ namespace configuration for extensions

linkerd-jaeger
--------------
√ linkerd-jaeger extension Namespace exists
√ jaeger extension pods are injected
√ jaeger injector pods are running
√ jaeger extension proxies are healthy
‼ jaeger extension proxies are up-to-date
    some proxies are not running the current version:
	* jaeger-injector-bdf688f96-ddcxf (edge-24.11.4)
	* jaeger-injector-bdf688f96-xf8hn (edge-24.11.4)
    see https://linkerd.io/2/checks/#l5d-jaeger-proxy-cp-version for hints
√ jaeger extension proxies and cli versions match

linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ can initialize the client
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ tap API service is running
√ linkerd-viz pods are injected
√ viz extension pods are running
√ viz extension proxies are healthy
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
	* metrics-api-cfbfbfcbc-jr4fn (edge-24.11.4)
	* metrics-api-cfbfbfcbc-qbzgz (edge-24.11.4)
	* prometheus-5464dc854b-wl6w6 (edge-24.11.4)
	* tap-858b7b86d4-gl9mn (edge-24.11.4)
	* tap-858b7b86d4-vb8wz (edge-24.11.4)
	* tap-injector-d49bf4cfb-djfcv (edge-24.11.4)
	* tap-injector-d49bf4cfb-fv2zp (edge-24.11.4)
	* web-5dd7bf96db-dj55j (edge-24.11.4)
	* web-5dd7bf96db-qcg8m (edge-24.11.4)
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
√ viz extension proxies and cli versions match
√ prometheus is installed and configured correctly
√ viz extension self-check

Status check results are √

Environment

Kubernetes version - 1.30
Cluster Environment - EKS
Host OS - Bottle Rocket

Possible solution

No response

Additional context

Im not sure why port 4318, ever shows up in the logs. Its configured for 4317

LINKERD2_PROXY_TRACE_COLLECTOR_SVC_ADDR:                   alloy-cluster.grafana-alloy.svc.cluster.local:4317
LINKERD2_PROXY_TRACE_PROTOCOL:                             opentelemetry
LINKERD2_PROXY_TRACE_SERVICE_NAME:                         linkerd-proxy
LINKERD2_PROXY_TRACE_COLLECTOR_SVC_NAME:                   alloy.grafana-alloy.serviceaccount.identity.linkerd.cluster.local

The pod it's failing to connect to, is an alloy pod.

❯ kubectl get pods -A -o wide | rg 10.1.16.177                                                
grafana-alloy               alloy-68fbb65465-wc6v5                                      3/3     Running     0             20h    10.1.16.177   i-03b207599fb2636db.us-gov-west-1.compute.internal   <none>           <none>

The linkerd proxy, on the alloy pod logs this

{"timestamp":"2024-12-04T17:01:18.536899Z","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","threadId":"ThreadId(1)"}                                         
{"timestamp":"2024-12-04T17:01:18.553171Z","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","threadId":"ThreadId(1)"}                                         
{"timestamp":"2024-12-04T17:01:19.037516Z","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","threadId":"ThreadId(1)"}                                         
{"timestamp":"2024-12-04T17:01:19.054832Z","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","threadId":"ThreadId(1)"} 

There are no actual errors on the actual alloy pod itself.

Everything that is not Linkerd-proxy, is still able to send traces without a problem, including pods that are fully in the mesh themselves.

Would you like to work on fixing this bug?

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions