We need to be able to support exporting harvester system logs outside the cluster, to allow for collecting and analyzing those logs to aid in debugging and harvester management.
List the specific goals of the enhancement. How will we know that this has succeeded?
- The user can aggregate harvester logs in a centralized location
- k8s cluster logs
- host system logs (ie
/var/log
)
- The user can export the aggregated harvester logs outside the cluster (ex rancher)
- this enhancement does not cover integration into the harvester / rancher UI (but ideally this will eventually be implemented as well)
Install rancher-logging
as a ManagedChart
in the Harvester Installer to collect logs from
the Harvester cluster.
To collect the host system logs on each node, we will patch and enable rancher-logging
's rke2
logging source. This
will deploy a DaemonSet
to mount each node's /var/log/journal
. Collecting all logs under /var/log/journal
is too
much, so we will add systemd filters to the deployed fluent-bit
pod to only collect logs from the kernel and important
services:
Systemd_Filter _SYSTEMD_UNIT=rke2-server.service
Systemd_Filter _SYSTEMD_UNIT=rke2-agent.service
Systemd_Filter _SYSTEMD_UNIT=rancherd.service
Systemd_Filter _SYSTEMD_UNIT=rancher-system-agent.service
Systemd_Filter _SYSTEMD_UNIT=wicked.service
Systemd_Filter _SYSTEMD_UNIT=iscsid.service
Systemd_Filter _TRANSPORT=kernel
Users will be able to configure where to send the logs by applying ClusterOutput
and ClusterFlow
to the cluster, but
by default none will be added.
Once installed the cluster should have the pods below:
NAME READY STATUS RESTARTS AGE
rancher-logging-685cf9664-w4wl2 1/1 Running 0 17m
rancher-logging-rke2-journald-aggregator-6zfsp 1/1 Running 0 17m
rancher-logging-root-fluentbit-hj72q 1/1 Running 0 15m
rancher-logging-root-fluentd-0 2/2 Running 0 15m
rancher-logging-root-fluentd-configcheck-ac2d4553 0/1 Completed 0 17m
Currently, users need to manually check harvester for failing pods or services and manually check logs using kubectl
.
This enhancement will allow users to send their logs using any of the output plugins, and be able to look at and filter the logs of the entire cluster without being limited to 1 pod container at a time.
Users will be able to write and apply their own custom ClusterOutput
s and ClusterFlow
s to send logs to the desired
location. For example, to send logs to graylog,
you can use the following yamls:
# graylog-cluster-flow.yaml
apiVersion: logging.banzaicloud.io/v1beta1
kind: ClusterFlow
metadata:
name: "all-logs-gelf-hs"
namespace: "cattle-logging-system"
spec:
globalOutputRefs:
- "example-gelf-hs"
# graylog-cluster-output.yaml
apiVersion: logging.banzaicloud.io/v1beta1
kind: ClusterOutput
metadata:
name: "example-gelf-hs"
namespace: "cattle-logging-system"
spec:
gelf:
host: "192.168.122.159"
port: 12202
protocol: "udp"
You can verify that they are installed successfully, you can check them with kubectl
:
>>> kubectl get clusteroutputs -n cattle-logging-systems example-gelf-hs
NAME ACTIVE PROBLEMS
example-gelf-hs true
>>> kubectl get clusterflows -n cattle-logging-systems example-gelf-hs
NAME ACTIVE PROBLEMS
all-logs-gelf-hs true
Loki will not be installed to the cluster by default, but you can manually install it:
- Install the helm chart:
helm install --repo https://grafana.github.io/helm-charts --values enhancements/20220525-system-logging/loki-values.yaml --version 2.7.0 --namespace cattle-logging-system loki-stack loki-stack
- Apply the
ClusterFlow
andClusterOutput
:kubectl apply -f enhancements/20220525-system-logging/loki.yaml
After some time for the logging operator to load and apply the ClusterFlow
and ClusterOutput
you will see the logs
flowing into loki.
To view the Loki UI, you can go to port forward to the loki-stack-grafana
service. For example map localhost:3000
to loki-stack-grafana
s port 80:
kubectl port-forward --namespace cattle-logging-system service/loki-stack-grafana 3000:80
The default username is admin
. To get the password for the admin user run the following command:
kubectl get secret --namespace cattle-logging-system loki-stack-grafana --output jsonpath="{.data.admin-password}" | base64 --decode
Once the UI is open on the left tab you can select the "Explore" tab to take you to a page where you manually send queries to loki:
In the search bar you can enter queries to select the logs you are interested in. The query language is described in the
grafana loki docs. For example, to select all logs in the
cattle-logging-system
:
Or to select harvester host machine logs from the rke2-server.service
service:
Due to limitations with enabling and disabling ManagedCharts
the logging feature will be enabled by default, and
cannot be disabled later by the user.
Logging is not backed by a PVC, users need to configure a log receiver (see instructions above) in order to view and store logs.
None.
The bulk of the logging functionality is handled by installing the rancher-logging
to deploy the main logging components:
Name | Purpose |
---|---|
logging-operator | manages the ClusterFlow s and ClusterOutput s defining log routes |
fluentd | the central log aggregator which will forward logs to other log collectors |
fluent-bit | collects the logs from teh cluster pods |
journald-aggregator | deploys a pod to collect the journal logs from each node |
The journald-aggregator is a fluent-bit
pod which collects the node logs by mounting the host's /var/log/journal
directory to the pod. Using the systemd
input plugin, the
logs can be filtered by the fields in the log entry (ex _SYSTEMD_UNIT
, _TRANSPORT
, etc). Collecting all the jogs
from /var/log/journal
is too much, so we only select logs from some important services: rke2-server, rke2-agent,
rancherd, rancher-system-agent, wicked, iscsid, and kernal logs.
The logging feature is enabled by default; however, due to an issue
with enabling and disabling ManagedChart
s, it cannt be disabled.
- Install harvester cluster
- Check that the related pods are created
rancher-logging
rancher-logging-rke2-journald-aggregator
rancher-logging-root-fluentbit
rancher-logging-root-fluentd-0
rancher-logging-root-fluentd-configcheck
- Setup a log server, e.g.
Graylog
,Splunck
,ElasticSearch
,Webhook
. Make sure it is network reachable with Harvester cluster VIP. - Config
ClusterFlow
andClusterOutput
to route to this server - Verify logs are being routed to the configured
ClusterOutput
- Verify k8s pod logs are received
- Verify hoist kernel and systemd logs are received
No user intervention is required during the upgrade. After upgrade, logging should be installed and enabled.