Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Replace Grafana with Plutono, Loki with Vali (gardener#7318)
* Add scripts to replace Grafana with Plutono, Loki with Vali, Promtail with Valitail These scripts are used to generate the consecutive commits. .scripts/run.sh The scripts are removed later in this PR. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com> Co-authored-by: Kristian-ZH <k.zhelyazkov@sap.com> * 1. Replace the Grafana Github page with that of Plutono git grep -z -l github\\.com/grafana/grafana -- ':!/vendor' ':!/.scripts' ':!NOTICE.md' \ | xargs -0 sed -i -E 's|github.com/grafana/grafana|github.com/credativ/plutono|g' * 2. Replace the Loki Github page with that of Vali git grep -z -l github\\.com/grafana/loki -- ':!/vendor' ':!/.scripts' ':!NOTICE.md' \ | xargs -0 sed -i -E 's|github.com/grafana/loki|github.com/credativ/vali|g' * 3. Replace the Grafana container image with that of Plutono git grep -z -l "repository: eu.gcr.io/gardener-project/3rd/grafana/grafana" -- ':!/vendor' ':!/.scripts' ':!NOTICE.md' \ | xargs -0 sed -i -E 's|repository: eu.gcr.io/gardener-project/3rd/grafana/grafana|repository: ghcr.io/credativ/plutono| s/tag: "7.5.17"/tag: "v7.5.21"/' * 4. Replace the Loki and Promtail container images with that of Vali and Valitail git grep -z -l "repository: eu.gcr.io/gardener-project/3rd/grafana/loki" -- ':!/vendor' ':!/.scripts' ':!NOTICE.md' \ | xargs -0 sed -i -E 's|repository: eu.gcr.io/gardener-project/3rd/grafana/loki|repository: ghcr.io/credativ/vali| s|repository: eu.gcr.io/gardener-project/3rd/grafana/promtail|repository: ghcr.io/credativ/valitail| s/tag: "2.2.1"/tag: "v2.2.5"/' * 5. Use the Plutono Github page as a generic web link The generic web link 'https://github.com/credativ/plutono' is not as specific as some of the previous links, but we do not have documentation for the Plutono and Vali projects, so it is up to the reader to find the matching documentation pages by following the links from the Plutono landing page. git grep -z -l grafana\\.com -- ':!/vendor' ':!/.scripts' ':!NOTICE.md' \ | xargs -0 sed -i -E 's|grafana.com([^) ]*)|github.com/credativ/plutono|g' * 6. Replace grafana with plutono in folder names find ./* -type d -name grafana -not -path './vendor/*' -print0 \ | while IFS= read -r -d '' folder do mv "$folder" "${folder/grafana/plutono}" done find ./* -type l -not -path './vendor/*' -print0 \ | while IFS= read -r -d '' link do target=$(readlink "$link") rm "$link" ln -s "${target/grafana/plutono}" "$link" done * 7. Replace loki with vali in folder names find ./* -type d -name loki -not -path './vendor/*' -print0 \ | while IFS= read -r -d '' folder do mv "$folder" "${folder/loki/vali}" done * 8. Replace promtail with valitail in folder names find ./* -type d -name promtail -not -path './vendor/*' -print0 \ | while IFS= read -r -d '' folder do mv "$folder" "${folder/promtail/valitail}" done * 9. Replace grafana with plutono in file names find ./* -type f -name '*grafana*' -not -path './vendor/*' -print0 \ | while IFS= read -r -d '' file do mv "$file" "${file/grafana/plutono}" done * 10. Replace loki with vali in file names find ./* -type f -name '*loki*' -not -path './vendor/*' -print0 \ | while IFS= read -r -d '' file do mv "$file" "${file/loki/vali}" done * 11. Replace promtail with valitail in file names find ./* -type f -name '*promtail*' -not -path './vendor/*' -print0 \ | while IFS= read -r -d '' file do mv "$file" "${file/promtail/valitail}" done * 12. Replace grafana with plutono in file contents git grep -z -l grafana -- ':!/vendor' ':!/.scripts' ':!NOTICE.md' \ | xargs -0 sed -i 's/grafana/plutono/g' * 13. Replace loki with vali in file contents git grep -z -l loki -- ':!/vendor' ':!/.scripts' ':!NOTICE.md' ':!*crd-fluentbit.fluent.io_*outputs.yaml' \ | xargs -0 sed -i 's/loki/vali/g' * 14. Replace promtail with valitail in file contents git grep -z -l promtail -- ':!/vendor' ':!/.scripts' ':!NOTICE.md' \ | xargs -0 sed -i 's/promtail/valitail/g' * 15. Replace Grafana with Plutono in file contents git grep -z -l Grafana -- ':!/vendor' ':!/.scripts' ':!NOTICE.md' \ | xargs -0 sed -i 's/Grafana/Plutono/g' * 16. Replace GF_ with PL_ in file contents GF_ is the Grafana prefix that is used in environment variables. git grep -z -l " GF_" -- ':!/vendor' ':!/.scripts' ':!NOTICE.md' \ | xargs -0 sed -i 's/GF_/PL_/g' * 17. Replace Loki with Vali in file contents git grep -z -l Loki -- ':!/vendor' ':!/.scripts' ':!NOTICE.md' ':!*crd-fluentbit.fluent.io_*outputs.yaml' \ | xargs -0 sed -i 's/Loki/Vali/g' * 18. Replace Promtail with Valitail in file contents git grep -z -l Promtail -- ':!/vendor' ':!/.scripts' ':!NOTICE.md' \ | xargs -0 sed -i 's/Promtail/Valitail/g' * Address PR comments: revert the changes in docs/proposals RF> We don't maintain GEPs, so revert changes in docs/proposals git checkout <pr-base> -- docs/proposals Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com> * Remove the Grafana->Plutono, Loki->Vali, Promtail->Valitail replacement scripts These scripts were used to generate the preceding commits. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com> Co-authored-by: Kristian-ZH <k.zhelyazkov@sap.com> * make generate Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com> * Update the gardener/logging images to a version that supports vali The PR gardener/logging#186 in the gardener/logging repository replaces Loki with Vali. This commit updates the 4 images to a release that contains that change. - gardener/fluent-bit-to-vali - gardener/vali-curator - gardener/telegraf-iptables - gardener/event-logger Co-authored-by: Niki Dokovski <nickytd@gmail.com> Co-authored-by: Kristian-ZH <k.zhelyazkov@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com> * Fix the integration test: test/integration/gardenlet/shoot/care Grafana was replaced with Plutono, and the missing deployments are sorted alphabetically, so we need to change the order to let the integration tests pass. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com> * Adjust the Vali prefix (and keep the Plutono prefix) The Vali prefix is used to ship logs from the shoot nodes to the control plane and hence it can be changed consistently during the shoot reconciliation. The Plutono prefix is deliberately not changed here and will be adjusted in a future PR. Currently the Grafana/Plutono prefix (gu) is also hard coded in the Dashboard and to avoid the need for a coordinated release with the Dashboard and to avoid showing an incorrect link in the Dashboard until the shoot cluster is reconciled, we rather postpone this cleanup to a future PR where maybe the Dashboard will be able to pick the Plutono URL from the monitoring secret and hence will support a smooth transition to a new prefix. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com> * Improve logging cleanup Previously the DeleteVali function did not clean up all the logging artifacts properly. This likely went unnoticed because the logging stack is rarely deleted. We noticed it during the migration from loki to vali. The ConfigMap vali-config actually has a hash suffix and can not be deleted by name. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com> * Fix data source names for the extension dashboards If an extension contributes dashboards to Grafana/Plutono, their annotation data source may need to be adjusted to the Grafana/Plutono transition, because it may contain the hardcoded string `Grafana`, which needs to be replaced with `Plutono`. Dashboards could also be using a hardcoded string `loki` as a data source for showing logs, which needs to be renamed to `vali`. We replace these strings in memory. This is to ensure that when we transition to Plutono/Vali, extension dashboards can be loaded successfully in Plutono. Later, and without time pressure, we can adapt the dashboards in the individual extension repositories. We cloned all the extensions in the Gardener github organization and checked the occurrences of the Grafana or Loki strings with this command: $ for i in gardener-extension-*; do echo $i; cd $i; git grep -i -r -E 'grafana|loki' -- ':!/vendor'; cd - >/dev/null; done gardener-extension-networking-calico charts/internal/calico-monitoring/calico-felix-dashboard.json: "datasource": "-- Grafana --", charts/internal/calico-monitoring/calico-typha-dashboard.json: "datasource": "-- Grafana --", gardener-extension-networking-cilium charts/internal/cilium-monitoring/cilium-agent-metrics-dashboard.json: "datasource": "-- Grafana --", charts/internal/cilium-monitoring/cilium-operator-metrics-dashboard.json: "datasource": "-- Grafana --", charts/internal/cilium-monitoring/hubble-metrics-dashboard.json: "datasource": "-- Grafana --", gardener-extension-os-coreos gardener-extension-os-gardenlinux gardener-extension-os-ubuntu gardener-extension-provider-alicloud gardener-extension-provider-aws gardener-extension-provider-azure gardener-extension-provider-equinix-metal gardener-extension-provider-gcp gardener-extension-provider-openstack gardener-extension-provider-vsphere .ci/terraform/local-values.yaml: loki: gardener-extension-registry-cache gardener-extension-runtime-gvisor gardener-extension-shoot-cert-service charts/internal/shoot-cert-management-seed/cert-dashboard.json: "datasource": "loki", gardener-extension-shoot-dns-service gardener-extension-shoot-networking-filter gardener-extension-shoot-networking-problemdetector charts/internal/shoot-network-problem-detector-controller-seed/nwpd-dashboard.json: "datasource": "-- Grafana --", gardener-extension-shoot-networking-traffic-gauger gardener-extension-shoot-oidc-service This suggests that this commit covers all the 7 cases where a replacement is needed. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com> * Add the hash of the labels as a suffix to the fluentbit custom resource The fluentbit custom resource contains a single set of labels, that is used for - labels of the daemon set - label selector of the daemon set - labels of the pod template Generally it is possible to change the labels of the pod template, but the label selector field of a daemonset is immutable. The fluent operator uses the same set of labels for both, which means that we can not change the pod template labels. A mitigation is to add a hash of the labels to the name of the fluentbit custom resource. When we change the labels, we'll get a new hash. The gardener resource manager will delete the old fluentbit resource and create a new one. The fluentbit operator will delete the old daemon set and create a new one. Hence we won't face the issue of trying to change an immutable field in the daemonset. Co-authored-by: Niki Dokovski <nickytd@gmail.com> Co-authored-by: Kristian-ZH <k.zhelyazkov@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com> * --- Empty separator commit The following commits shall be reverted in a future release because they are only needed for the transition phase. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com> * Delete Grafana As part of the migration from Grafana to Plutono, we delete the Grafana deployment and related resources. They are replaced by the Plutono artefacts. The newly added functions are copies of their "Plutono" counterparts. This commit can be reverted in a future release of Gardener. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com> * Delete Loki (except for its PVC+PV) During the migration from Loki to Vali, we delete the Loki StatefulSet and related resources. They are replaced by the Vali artefacts. The exception here is the PersistentVolumeClaim, which can be reused with Vali, thereby preserving the logs stored in Loki. The gardenlet needs to be able to delete the old loki artefacts, hence the cluster role is adjusted. The newly added functions are copies of their "Vali" counterparts. The deletion is guarded by the lokiPvcExists condition: the existence of a loki-loki-0 persistent volume claim indicates that there is a loki based logging stack and it shall be deleted. The existing unit tests are adjusted so that the mock client returns false for the lokiPvcExists condition. This commit can be reverted in a future release of Gardener. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com> * Rename the Loki PVC to Vali (!) The name of a PVC is derived from the - the name of the pod which is created by the StatefulSet, e.g. vali-0, and - the name of the volume claim template in the StatefulSet, e.g. vali. So the `loki` StatefulSet uses a `loki-loki-0` PVC, whereas the new `vali` StatefulSet uses a `vali-vali-0` PVC. The StatefulSet creates a new PVC based on the PVC template if one does not exist, or it uses the existing one with the matching name. The intention of this PR is to reuse the disk of Loki for Vali so that the 2 weeks worth of logs that were ingested into Loki are not lost due to the migration to Vali. Hence, to reuse the disk of the Loki StatefulSet for the Vali StatefulSet, we need to rename the `loki-loki-0` PVC to `vali-vali-0`. It is unfortunately not possible to "simply" rename a PVC: the name is part of the identity of a Kubernetes resource and it can not be changed. However, with the approach described in gardener#7318 (comment) it is possible to create a new PVC with the new name `vali-vali-0` that is bound to the persistent volume of the `loki-loki-0` PVC. The trick is to change the PV's ReclaimPolicy temporarily to Retain and delete the Loki PVC. The PV will not be deleted due to the Retain reclaim policy. After setting the claimRef of the PV to nil, a new Vali PVC can be created which will be bound by the PV/PVC controller in the kube controller manager to the existing PV. Ultimately this leads to a state that is similar to "simply" renaming the PVC. An executable example flow to rename loki to vali in the control plane looks like this: #!/bin/bash -exv # Mark the volume of loki kubectl exec loki-0 -c loki -- sh -c "echo loki > /data/hello" kubectl get sts loki -o yaml | kubectl neat > sts.yml kubectl get pvc loki-loki-0 -o yaml | kubectl neat > pvc.yml kubectl delete sts loki pv_id=$(kubectl get pvc loki-loki-0 -o jsonpath='{.spec.volumeName}') kubectl patch pv "$pv_id" -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}' kubectl delete pvc loki-loki-0 kubectl patch pv "$pv_id" -p '{"spec":{"claimRef": null }}' sed 's/loki-loki-0/vali-vali-0/g' pvc.yml | kubectl apply -f - sed 's/name: loki$/name: vali/g' sts.yml | kubectl apply -f - kubectl patch pv "$pv_id" -p '{"spec":{"persistentVolumeReclaimPolicy":"Delete"}}' # Wait for the vali StatefulSet to be ready kubectl rollout status sts/vali # Verify that the volume of vali is the former loki volume kubectl exec vali-0 -c vali -- cat /data/hello Hopefully the explanation above helps to follow the coding that is added in this commit. In case of unexpected errors, we accept that we can not reuse the Loki PV and attempt to clean up the PV/PVC so that the next reconciliation can proceed with ultimately creating a new PV for Vali. Note that after we successfully deleted the Loki PVC, we can't easily find in the next iteration the corresponding (unnamespaced) Loki PV, hence we try to do the clean up in the first iteration. One expected error cases is when the 30s deadline of the shoot reconciliation is reached, so we use a new context with a 1 minute deadline for the more critical steps in the second half of the flow and also for the recovery steps after an unexpected error. Note that the loki-0 pod currently can not terminate gracefully in a timely manner: it is eventually killed by the kubelet after 30s. We reduce the graceful termination timeout to 5 seconds when deleting Loki because based on its logs, loki is itself waiting for 30s after an almost complete graceful shutdown in 1s. By reducing the graceful termination timeout we can use the remaining time to proceed with the flow above. This commit can be reverted in a future release of Gardener. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com> * Rename the loki folder on loki's disk to vali This commit can be reverted in a future release of Gardener. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com> Co-authored-by: Shafeeque E S <shafeeque.e.s@sap.com> * Delete the seed's garden Grafana artifacts This commit can be reverted in a future release of Gardener. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com> * Rename Loki's PVC to Vali and delete Loki in the garden namespace This commit can be reverted in a future release of Gardener. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com> * Allow the fluentbits to talk to loki during the transition phase When the gardenlet is updated to this PR, but the shoots are not yet reconciled, the fluentbits in the garden namespace are sending the control plane logs to the logging service in the control plane which still points to the loki-0 pod. By adding this label, the generated network policies allow the network communication between the fluentbits and loki-0. When the shoot is reconciled, loki-0 is going to be replaced by vali-0, and that case is already covered by the network policy label in the line above. This commit can be reverted in a future release of Gardener. Co-authored-by: Niki Dokovski <nickytd@gmail.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com> * Improve health check during the transition period In the time period between a seed being upgraded to the release that contains this PR and the shoot being reconciled, the seed's gardenlet should not expect the Plutono/Vali artifacts to be healthy, but instead expect the old Grafana/Loki deployment or statefulset to be there. This commit can be reverted in a future release. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com> * Use wait.PollUntilWithContext instead of a for loop This commit can be reverted in a future release. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com> * Stop the promtail service on the shoot nodes The logs from the pods in the shoot cluster's kube-system namespace are sent to Loki/Vali in the control plane (in the seed) by the promtail/valitail systemd service that runs on each shoot node. The promtail.service starts promtail, the auxiliary promtail-fetch-token.service periodically fetches a bearer token that is used to authenticate promtail with Loki. This PR so far consistently renamed promtail to valitail, and hence when applied, a new valitail.service and valitail-fetch-token.service will be added to systemd on each shoot node. However, this generic rename approach does not stop/clean up the old promtail systemd services which would continue to run on the "updated" shoot nodes. The old promtail services would not "work" because promtail is not compatible with Vali, but they would still run and unnecessarily consume some memory (~50MB) on the shoot nodes. This commit takes care of stopping the promtail services to avoid unnecessarily using memory on the shoot nodes. Note that the promtail services are not explicitly "removed" by this commit, they are just stopped: it is sufficient to do so to prevent using resources unnecessarily. When the node is recreated due to an OS/Kubernetes version upgrade or due to cluster autoscaling events, only the new valitail services are going to be provisioned. In that pristine state, upon startup, a single log line in journalctl will show that the valitail service attempted to stop the no longer existing promtail services. This commit can be reverted in a future release. Note that we need to stop the promtail services only once on updated nodes, so it is fine to revert this commit in the next Gardener release even if the shoot nodes are not recreated by then. Co-authored-by: Niki Dokovski <nickytd@gmail.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com> * Address PR comments: add release v1.72 to removal TODOs This commit can be reverted in a future release. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com> * Push back the migration code removal to 6 releases in the future That gives about 3 months time so that all the shoots on all the Gardener landscapes can be migrated from Loki to Vali. Note that the migration would happen in a healthy shoot in the very first reconciliation window. If the shoot is broken (e.g. invalid shoot infrastructure credentials) then the reconciliation will break on each attempt and that can delay the migration from Loki to Vali indefinitely. We assume that 3 months are sufficient so that eventually all the shoots are migrated and we can safely remove the migration logic by that time. This commit can be reverted in a future release. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com> --------- Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com> Co-authored-by: Kristian-ZH <k.zhelyazkov@sap.com> Co-authored-by: Niki Dokovski <nickytd@gmail.com> Co-authored-by: Shafeeque E S <shafeeque.e.s@sap.com>
- Loading branch information