Description
Describe the bug:
When one ClusterOutput
has an invalid endpoint, all logs (including those destined for a valid endpoint configured in a ClusterOutput
) cease to reach their destinations. fluentbit
Pods produce errors like the following:
[2025/04/14 21:37:36] [error] [upstream] connection #129 to tcp://10.43.190.83:24240 timed out after 10 seconds (connection timeout)
[2025/04/14 21:37:36] [error] [upstream] connection #130 to tcp://10.43.190.83:24240 timed out after 10 seconds (connection timeout)
[2025/04/14 21:37:36] [error] [upstream] connection #64 to tcp://10.43.190.83:24240 timed out after 10 seconds (connection timeout)
[2025/04/14 21:37:36] [ warn] [engine] failed to flush chunk '1-1744666485.766626455.flb', retry in 46 seconds: task_id=281, input=tail.0 > output=forward.0 (out_id=0)
[2025/04/14 21:37:36] [error] [output:forward:forward.0] no upstream connections available
[2025/04/14 21:37:36] [error] [output:forward:forward.0] no upstream connections available
[2025/04/14 21:37:36] [error] [output:forward:forward.0] no upstream connections available
[2025/04/14 21:37:36] [ warn] [engine] failed to flush chunk '1-1744666391.837929658.flb', retry in 147 seconds: task_id=170, input=tail.0 > output=forward.0 (out_id=0)
[2025/04/14 21:37:36] [ warn] [engine] failed to flush chunk '1-1744666392.763873065.flb', retry in 23 seconds: task_id=172, input=tail.0 > output=forward.0 (out_id=0)
Expected behaviour:
Logs destined for valid endpoints that are configured in a ClusterOutput
should reach their destination even if an invalid endpoint is configured in a ClusterOutput
.
Steps to reproduce the bug:
Install something that will act as a destination for the logs. I used Kibana and Elasticsearch, but my understanding is that it doesn't matter what you use, as long as you have a valid place to send and view logs:
- Install ECK operator as per https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s/install-using-yaml-manifest-quickstart
- Deploy an elasticsearch cluster as per https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s/elasticsearch-deployment-quickstart. I used the following manifest:
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: quickstart
namespace: cattle-logging-system
spec:
version: 8.17.4
nodeSets:
- name: default
count: 1
config:
node.store.allow_mmap: false
- Deploy Kibana as per https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s/kibana-instance-quickstart. I used the following manifest:
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
name: quickstart
namespace: cattle-logging-system
spec:
version: 8.17.4
count: 1
elasticsearchRef:
name: quickstart
The following is the part that is relevant to this issue:
helm upgrade --install --wait --create-namespace --namespace cattle-logging-system logging-operator --version 4.10.0 oci://ghcr.io/kube-logging/helm-charts/logging-operator
- Apply the following
Logging
andFluentbitAgent
:
apiVersion: logging.banzaicloud.io/v1beta1
kind: Logging
metadata:
name: rancher-logging-root
namespace: cattle-logging-system
spec:
controlNamespace: cattle-logging-system
fluentd:
disablePvc: true
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 15
tcpSocket:
port: 24240
metrics:
prometheusRules: false
serviceMonitor: false
---
apiVersion: logging.banzaicloud.io/v1beta1
kind: FluentbitAgent
metadata:
name: rancher-logging-root
namespace: cattle-logging-system
spec:
metrics:
prometheusRules: false
serviceMonitor: false
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
value: "true"
- effect: NoExecute
key: node-role.kubernetes.io/etcd
value: "true"
- Apply the following
ClusterOutput
:
apiVersion: logging.banzaicloud.io/v1beta1
kind: ClusterOutput
metadata:
name: elasticsearch
namespace: cattle-logging-system
spec:
elasticsearch:
host: quickstart-es-http.cattle-logging-system.svc.cluster.local
port: 9200
scheme: https
ssl_verify: false
ssl_version: TLSv1_2
user: elastic
password:
valueFrom:
secretKeyRef:
name: quickstart-es-elastic-user
key: elastic
buffer:
timekey: 1m
timekey_wait: 30s
timekey_use_utc: true
- Apply the following
ClusterFlow
:
apiVersion: logging.banzaicloud.io/v1beta1
kind: ClusterFlow
metadata:
name: elasticsearch
namespace: cattle-logging-system
spec:
globalOutputRefs:
- elasticsearch
- Check that logs are coming through in Elasticsearch/Kibana.
- Apply the following
ClusterOutput
:
apiVersion: logging.banzaicloud.io/v1beta1
kind: ClusterOutput
metadata:
name: badelasticsearch
namespace: cattle-logging-system
spec:
elasticsearch:
host: invalidaddress.cattle-logging-system.svc.cluster.local
port: 9200
scheme: https
ssl_verify: false
ssl_version: TLSv1_2
user: elastic
password:
valueFrom:
secretKeyRef:
name: quickstart-es-elastic-user
key: elastic
buffer:
timekey: 1m
timekey_wait: 30s
timekey_use_utc: true
- Apply the following
ClusterFlow
:
apiVersion: logging.banzaicloud.io/v1beta1
kind: ClusterFlow
metadata:
name: badelasticsearch
namespace: cattle-logging-system
spec:
globalOutputRefs:
- badelasticsearch
- Note that logs are no longer coming through in Elasticsearch/Kibana. Note errors in the logs of the fluentbit agent.
Additional context:
This was first noticed in the Rancher Logging helm chart, which repackages this project for easy use in Rancher. The reported issue is rancher/rancher#26771.
Environment details:
- Kubernetes version (e.g. v1.15.2): v1.31.6
- Cloud-provider/provisioner (e.g. AKS, GKE, EKS, PKE etc): k3s
- logging-operator version (e.g. 2.1.1): 4.10.0
- Install method (e.g. helm or static manifests): Rancher Logging
- Logs from the misbehaving component (and any other relevant logs): see above
- Resource definition (possibly in YAML format) that caused the issue, without sensitive data: see above
/kind bug