Skip to content

Logs do not reach any endpoints if one ClusterOutput is configured with an invalid endpoint #2013

Open
@adamkpickering

Description

@adamkpickering

Describe the bug:

When one ClusterOutput has an invalid endpoint, all logs (including those destined for a valid endpoint configured in a ClusterOutput) cease to reach their destinations. fluentbit Pods produce errors like the following:

[2025/04/14 21:37:36] [error] [upstream] connection #129 to tcp://10.43.190.83:24240 timed out after 10 seconds (connection timeout)
[2025/04/14 21:37:36] [error] [upstream] connection #130 to tcp://10.43.190.83:24240 timed out after 10 seconds (connection timeout)
[2025/04/14 21:37:36] [error] [upstream] connection #64 to tcp://10.43.190.83:24240 timed out after 10 seconds (connection timeout)
[2025/04/14 21:37:36] [ warn] [engine] failed to flush chunk '1-1744666485.766626455.flb', retry in 46 seconds: task_id=281, input=tail.0 > output=forward.0 (out_id=0)
[2025/04/14 21:37:36] [error] [output:forward:forward.0] no upstream connections available
[2025/04/14 21:37:36] [error] [output:forward:forward.0] no upstream connections available
[2025/04/14 21:37:36] [error] [output:forward:forward.0] no upstream connections available
[2025/04/14 21:37:36] [ warn] [engine] failed to flush chunk '1-1744666391.837929658.flb', retry in 147 seconds: task_id=170, input=tail.0 > output=forward.0 (out_id=0)
[2025/04/14 21:37:36] [ warn] [engine] failed to flush chunk '1-1744666392.763873065.flb', retry in 23 seconds: task_id=172, input=tail.0 > output=forward.0 (out_id=0)

Expected behaviour:

Logs destined for valid endpoints that are configured in a ClusterOutput should reach their destination even if an invalid endpoint is configured in a ClusterOutput.

Steps to reproduce the bug:

Install something that will act as a destination for the logs. I used Kibana and Elasticsearch, but my understanding is that it doesn't matter what you use, as long as you have a valid place to send and view logs:

  1. Install ECK operator as per https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s/install-using-yaml-manifest-quickstart
  2. Deploy an elasticsearch cluster as per https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s/elasticsearch-deployment-quickstart. I used the following manifest:
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
  namespace: cattle-logging-system
spec:
  version: 8.17.4
  nodeSets:
  - name: default
    count: 1
    config:
      node.store.allow_mmap: false
  1. Deploy Kibana as per https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s/kibana-instance-quickstart. I used the following manifest:
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: quickstart
  namespace: cattle-logging-system
spec:
  version: 8.17.4
  count: 1
  elasticsearchRef:
    name: quickstart

The following is the part that is relevant to this issue:

  1. helm upgrade --install --wait --create-namespace --namespace cattle-logging-system logging-operator --version 4.10.0 oci://ghcr.io/kube-logging/helm-charts/logging-operator
  2. Apply the following Logging and FluentbitAgent:
apiVersion: logging.banzaicloud.io/v1beta1
kind: Logging
metadata:
  name: rancher-logging-root
  namespace: cattle-logging-system
spec:
  controlNamespace: cattle-logging-system
  fluentd:
    disablePvc: true
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 15
      tcpSocket:
        port: 24240
    metrics:
      prometheusRules: false
      serviceMonitor: false
---
apiVersion: logging.banzaicloud.io/v1beta1
kind: FluentbitAgent
metadata:
  name: rancher-logging-root
  namespace: cattle-logging-system
spec:
  metrics:
    prometheusRules: false
    serviceMonitor: false
  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/control-plane
    value: "true"
  - effect: NoExecute
    key: node-role.kubernetes.io/etcd
    value: "true"
  1. Apply the following ClusterOutput:
apiVersion: logging.banzaicloud.io/v1beta1
kind: ClusterOutput
metadata:
  name: elasticsearch
  namespace: cattle-logging-system
spec:
  elasticsearch:
    host: quickstart-es-http.cattle-logging-system.svc.cluster.local
    port: 9200
    scheme: https
    ssl_verify: false
    ssl_version: TLSv1_2
    user: elastic
    password:
      valueFrom:
        secretKeyRef:
          name: quickstart-es-elastic-user
          key: elastic
    buffer:
      timekey: 1m
      timekey_wait: 30s
      timekey_use_utc: true
  1. Apply the following ClusterFlow:
apiVersion: logging.banzaicloud.io/v1beta1
kind: ClusterFlow
metadata:
  name: elasticsearch
  namespace: cattle-logging-system
spec:
  globalOutputRefs:
    - elasticsearch
  1. Check that logs are coming through in Elasticsearch/Kibana.
  2. Apply the following ClusterOutput:
apiVersion: logging.banzaicloud.io/v1beta1
kind: ClusterOutput
metadata:
  name: badelasticsearch
  namespace: cattle-logging-system
spec:
  elasticsearch:
    host: invalidaddress.cattle-logging-system.svc.cluster.local
    port: 9200
    scheme: https
    ssl_verify: false
    ssl_version: TLSv1_2
    user: elastic
    password:
      valueFrom:
        secretKeyRef:
          name: quickstart-es-elastic-user
          key: elastic
    buffer:
      timekey: 1m
      timekey_wait: 30s
      timekey_use_utc: true
  1. Apply the following ClusterFlow:
apiVersion: logging.banzaicloud.io/v1beta1
kind: ClusterFlow
metadata:
  name: badelasticsearch
  namespace: cattle-logging-system
spec:
  globalOutputRefs:
    - badelasticsearch
  1. Note that logs are no longer coming through in Elasticsearch/Kibana. Note errors in the logs of the fluentbit agent.

Additional context:

This was first noticed in the Rancher Logging helm chart, which repackages this project for easy use in Rancher. The reported issue is rancher/rancher#26771.

Environment details:

  • Kubernetes version (e.g. v1.15.2): v1.31.6
  • Cloud-provider/provisioner (e.g. AKS, GKE, EKS, PKE etc): k3s
  • logging-operator version (e.g. 2.1.1): 4.10.0
  • Install method (e.g. helm or static manifests): Rancher Logging
  • Logs from the misbehaving component (and any other relevant logs): see above
  • Resource definition (possibly in YAML format) that caused the issue, without sensitive data: see above

/kind bug

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions