Skip to content

Commit

Permalink
Replace Grafana with Plutono, Loki with Vali (gardener#7318)
Browse files Browse the repository at this point in the history
* Add scripts to replace Grafana with Plutono, Loki with Vali, Promtail with Valitail

These scripts are used to generate the consecutive commits.

  .scripts/run.sh

The scripts are removed later in this PR.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Co-authored-by: Kristian-ZH <k.zhelyazkov@sap.com>

* 1. Replace the Grafana Github page with that of Plutono

git grep -z -l github\\.com/grafana/grafana -- ':!/vendor' ':!/.scripts' ':!NOTICE.md' \
| xargs -0 sed -i -E 's|github.com/grafana/grafana|github.com/credativ/plutono|g'

* 2. Replace the Loki Github page with that of Vali

git grep -z -l github\\.com/grafana/loki -- ':!/vendor' ':!/.scripts' ':!NOTICE.md' \
| xargs -0 sed -i -E 's|github.com/grafana/loki|github.com/credativ/vali|g'

* 3. Replace the Grafana container image with that of Plutono

git grep -z -l "repository: eu.gcr.io/gardener-project/3rd/grafana/grafana" -- ':!/vendor' ':!/.scripts' ':!NOTICE.md' \
| xargs -0 sed -i -E 's|repository: eu.gcr.io/gardener-project/3rd/grafana/grafana|repository: ghcr.io/credativ/plutono|
                      s/tag: "7.5.17"/tag: "v7.5.21"/'

* 4. Replace the Loki and Promtail container images with that of Vali and Valitail

git grep -z -l "repository: eu.gcr.io/gardener-project/3rd/grafana/loki" -- ':!/vendor' ':!/.scripts' ':!NOTICE.md' \
| xargs -0 sed -i -E 's|repository: eu.gcr.io/gardener-project/3rd/grafana/loki|repository: ghcr.io/credativ/vali|
                      s|repository: eu.gcr.io/gardener-project/3rd/grafana/promtail|repository: ghcr.io/credativ/valitail|
                      s/tag: "2.2.1"/tag: "v2.2.5"/'

* 5. Use the Plutono Github page as a generic web link

The generic web link 'https://github.com/credativ/plutono'
is not as specific as some of the previous links, but we do not
have documentation for the Plutono and Vali projects, so it
is up to the reader to find the matching documentation pages
by following the links from the Plutono landing page.

git grep -z -l grafana\\.com -- ':!/vendor' ':!/.scripts' ':!NOTICE.md' \
| xargs -0 sed -i -E 's|grafana.com([^) ]*)|github.com/credativ/plutono|g'

* 6. Replace grafana with plutono in folder names

find ./* -type d -name grafana -not -path './vendor/*' -print0 \
| while IFS= read -r -d '' folder
  do
    mv "$folder" "${folder/grafana/plutono}"
  done

find ./* -type l -not -path './vendor/*' -print0 \
| while IFS= read -r -d '' link
  do
    target=$(readlink "$link")
    rm "$link"
    ln -s "${target/grafana/plutono}" "$link"
  done

* 7. Replace loki with vali in folder names

find ./* -type d -name loki -not -path './vendor/*' -print0 \
| while IFS= read -r -d '' folder
  do
    mv "$folder" "${folder/loki/vali}"
  done

* 8. Replace promtail with valitail in folder names

find ./* -type d -name promtail -not -path './vendor/*' -print0 \
| while IFS= read -r -d '' folder
  do
    mv "$folder" "${folder/promtail/valitail}"
  done

* 9. Replace grafana with plutono in file names

find ./* -type f -name '*grafana*' -not -path './vendor/*' -print0 \
| while IFS= read -r -d '' file
  do
    mv "$file" "${file/grafana/plutono}"
  done

* 10. Replace loki with vali in file names

find ./* -type f -name '*loki*' -not -path './vendor/*' -print0 \
| while IFS= read -r -d '' file
  do
    mv "$file" "${file/loki/vali}"
  done

* 11. Replace promtail with valitail in file names

find ./* -type f -name '*promtail*' -not -path './vendor/*' -print0 \
| while IFS= read -r -d '' file
  do
    mv "$file" "${file/promtail/valitail}"
  done

* 12. Replace grafana with plutono in file contents

git grep -z -l grafana -- ':!/vendor' ':!/.scripts' ':!NOTICE.md' \
| xargs -0 sed -i 's/grafana/plutono/g'

* 13. Replace loki with vali in file contents

git grep -z -l loki -- ':!/vendor' ':!/.scripts' ':!NOTICE.md' ':!*crd-fluentbit.fluent.io_*outputs.yaml' \
| xargs -0 sed -i 's/loki/vali/g'

* 14. Replace promtail with valitail in file contents

git grep -z -l promtail -- ':!/vendor' ':!/.scripts' ':!NOTICE.md' \
| xargs -0 sed -i 's/promtail/valitail/g'

* 15. Replace Grafana with Plutono in file contents

git grep -z -l Grafana -- ':!/vendor' ':!/.scripts' ':!NOTICE.md' \
| xargs -0 sed -i 's/Grafana/Plutono/g'

* 16. Replace GF_ with PL_ in file contents

GF_ is the Grafana prefix that is used in environment variables.

git grep -z -l " GF_" -- ':!/vendor' ':!/.scripts' ':!NOTICE.md' \
| xargs -0 sed -i 's/GF_/PL_/g'

* 17. Replace Loki with Vali in file contents

git grep -z -l Loki -- ':!/vendor' ':!/.scripts' ':!NOTICE.md' ':!*crd-fluentbit.fluent.io_*outputs.yaml' \
| xargs -0 sed -i 's/Loki/Vali/g'

* 18. Replace Promtail with Valitail in file contents

git grep -z -l Promtail -- ':!/vendor' ':!/.scripts' ':!NOTICE.md' \
| xargs -0 sed -i 's/Promtail/Valitail/g'

* Address PR comments: revert the changes in docs/proposals

RF> We don't maintain GEPs, so revert changes in docs/proposals

git checkout <pr-base> -- docs/proposals

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

* Remove the Grafana->Plutono, Loki->Vali, Promtail->Valitail replacement scripts

These scripts were used to generate the preceding commits.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Co-authored-by: Kristian-ZH <k.zhelyazkov@sap.com>

* make generate

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

* Update the gardener/logging images to a version that supports vali

The PR gardener/logging#186 in the
gardener/logging repository replaces Loki with Vali.

This commit updates the 4 images to a release that contains that change.

  - gardener/fluent-bit-to-vali
  - gardener/vali-curator
  - gardener/telegraf-iptables
  - gardener/event-logger

Co-authored-by: Niki Dokovski <nickytd@gmail.com>
Co-authored-by: Kristian-ZH <k.zhelyazkov@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

* Fix the integration test: test/integration/gardenlet/shoot/care

Grafana was replaced with Plutono, and the missing deployments are sorted
alphabetically, so we need to change the order to let the integration
tests pass.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

* Adjust the Vali prefix (and keep the Plutono prefix)

The Vali prefix is used to ship logs from the shoot nodes to the control plane
and hence it can be changed consistently during the shoot reconciliation.

The Plutono prefix is deliberately not changed here and will be adjusted in a
future PR. Currently the Grafana/Plutono prefix (gu) is also hard coded in the
Dashboard and to avoid the need for a coordinated release with the Dashboard and
to avoid showing an incorrect link in the Dashboard until the shoot cluster is
reconciled, we rather postpone this cleanup to a future PR where maybe the
Dashboard will be able to pick the Plutono URL from the monitoring secret and
hence will support a smooth transition to a new prefix.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

* Improve logging cleanup

Previously the DeleteVali function did not clean up all the logging
artifacts properly. This likely went unnoticed because the logging stack is
rarely deleted. We noticed it during the migration from loki to vali.
The ConfigMap vali-config actually has a hash suffix and can not be
deleted by name.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

* Fix data source names for the extension dashboards

If an extension contributes dashboards to Grafana/Plutono, their annotation
data source may need to be adjusted to the Grafana/Plutono transition,
because it may contain the hardcoded string `Grafana`, which needs to be
replaced with `Plutono`. Dashboards could also be using a hardcoded string
`loki` as a data source for showing logs, which needs to be renamed to
`vali`. We replace these strings in memory. This is to ensure that when we
transition to Plutono/Vali, extension dashboards can be loaded successfully
in Plutono. Later, and without time pressure, we can adapt the dashboards
in the individual extension repositories.

We cloned all the extensions in the Gardener github organization and checked the
occurrences of the Grafana or Loki strings with this command:

  $ for i in gardener-extension-*; do echo $i; cd $i; git grep -i -r -E 'grafana|loki' -- ':!/vendor'; cd - >/dev/null; done
  gardener-extension-networking-calico
  charts/internal/calico-monitoring/calico-felix-dashboard.json:        "datasource": "-- Grafana --",
  charts/internal/calico-monitoring/calico-typha-dashboard.json:        "datasource": "-- Grafana --",
  gardener-extension-networking-cilium
  charts/internal/cilium-monitoring/cilium-agent-metrics-dashboard.json:        "datasource": "-- Grafana --",
  charts/internal/cilium-monitoring/cilium-operator-metrics-dashboard.json:        "datasource": "-- Grafana --",
  charts/internal/cilium-monitoring/hubble-metrics-dashboard.json:        "datasource": "-- Grafana --",
  gardener-extension-os-coreos
  gardener-extension-os-gardenlinux
  gardener-extension-os-ubuntu
  gardener-extension-provider-alicloud
  gardener-extension-provider-aws
  gardener-extension-provider-azure
  gardener-extension-provider-equinix-metal
  gardener-extension-provider-gcp
  gardener-extension-provider-openstack
  gardener-extension-provider-vsphere
  .ci/terraform/local-values.yaml:  loki:
  gardener-extension-registry-cache
  gardener-extension-runtime-gvisor
  gardener-extension-shoot-cert-service
  charts/internal/shoot-cert-management-seed/cert-dashboard.json:      "datasource": "loki",
  gardener-extension-shoot-dns-service
  gardener-extension-shoot-networking-filter
  gardener-extension-shoot-networking-problemdetector
  charts/internal/shoot-network-problem-detector-controller-seed/nwpd-dashboard.json:        "datasource": "-- Grafana --",
  gardener-extension-shoot-networking-traffic-gauger
  gardener-extension-shoot-oidc-service

This suggests that this commit covers all the 7 cases where a replacement is
needed.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

* Add the hash of the labels as a suffix to the fluentbit custom resource

The fluentbit custom resource contains a single set of labels, that is used for

- labels of the daemon set
- label selector of the daemon set
- labels of the pod template

Generally it is possible to change the labels of the pod template, but the label
selector field of a daemonset is immutable. The fluent operator uses the same
set of labels for both, which means that we can not change the pod template
labels.

A mitigation is to add a hash of the labels to the name of the fluentbit custom
resource. When we change the labels, we'll get a new hash. The gardener resource
manager will delete the old fluentbit resource and create a new one. The
fluentbit operator will delete the old daemon set and create a new one. Hence we
won't face the issue of trying to change an immutable field in the daemonset.

Co-authored-by: Niki Dokovski <nickytd@gmail.com>
Co-authored-by: Kristian-ZH <k.zhelyazkov@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

* --- Empty separator commit

The following commits shall be reverted in a future release because they are
only needed for the transition phase.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

* Delete Grafana

As part of the migration from Grafana to Plutono, we delete the Grafana
deployment and related resources. They are replaced by the Plutono artefacts.

The newly added functions are copies of their "Plutono" counterparts.

This commit can be reverted in a future release of Gardener.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

* Delete Loki (except for its PVC+PV)

During the migration from Loki to Vali, we delete the Loki StatefulSet and
related resources. They are replaced by the Vali artefacts.

The exception here is the PersistentVolumeClaim, which can be reused with Vali,
thereby preserving the logs stored in Loki.

The gardenlet needs to be able to delete the old loki artefacts, hence the
cluster role is adjusted.

The newly added functions are copies of their "Vali" counterparts.

The deletion is guarded by the lokiPvcExists condition: the existence of a
loki-loki-0 persistent volume claim indicates that there is a loki based logging
stack and it shall be deleted. The existing unit tests are adjusted so that the
mock client returns false for the lokiPvcExists condition.

This commit can be reverted in a future release of Gardener.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

* Rename the Loki PVC to Vali (!)

The name of a PVC is derived from the

- the name of the pod which is created by the StatefulSet, e.g. vali-0, and
- the name of the volume claim template in the StatefulSet, e.g. vali.

So the `loki` StatefulSet uses a `loki-loki-0` PVC, whereas the new `vali`
StatefulSet uses a `vali-vali-0` PVC.

The StatefulSet creates a new PVC based on the PVC template if one does not
exist, or it uses the existing one with the matching name.

The intention of this PR is to reuse the disk of Loki for Vali so that the 2
weeks worth of logs that were ingested into Loki are not lost due to the
migration to Vali.

Hence, to reuse the disk of the Loki StatefulSet for the Vali StatefulSet, we
need to rename the `loki-loki-0` PVC to `vali-vali-0`.

It is unfortunately not possible to "simply" rename a PVC: the name is part of
the identity of a Kubernetes resource and it can not be changed. However, with
the approach described in
gardener#7318 (comment) it is
possible to create a new PVC with the new name `vali-vali-0` that is bound to
the persistent volume of the `loki-loki-0` PVC.

The trick is to change the PV's ReclaimPolicy temporarily to Retain and delete
the Loki PVC. The PV will not be deleted due to the Retain reclaim policy. After
setting the claimRef of the PV to nil, a new Vali PVC can be created which will
be bound by the PV/PVC controller in the kube controller manager to the existing
PV. Ultimately this leads to a state that is similar to "simply" renaming the
PVC.

An executable example flow to rename loki to vali in the control plane looks
like this:

    #!/bin/bash -exv

    # Mark the volume of loki
    kubectl exec loki-0 -c loki -- sh -c "echo loki > /data/hello"

    kubectl get sts loki -o yaml        | kubectl neat > sts.yml
    kubectl get pvc loki-loki-0 -o yaml | kubectl neat > pvc.yml

    kubectl delete sts loki

    pv_id=$(kubectl get pvc loki-loki-0 -o jsonpath='{.spec.volumeName}')
    kubectl patch pv "$pv_id" -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'

    kubectl delete pvc loki-loki-0
    kubectl patch pv "$pv_id" -p '{"spec":{"claimRef": null }}'

    sed 's/loki-loki-0/vali-vali-0/g' pvc.yml | kubectl apply -f -
    sed 's/name: loki$/name: vali/g'  sts.yml | kubectl apply -f -

    kubectl patch pv "$pv_id" -p '{"spec":{"persistentVolumeReclaimPolicy":"Delete"}}'

    # Wait for the vali StatefulSet to be ready
    kubectl rollout status sts/vali

    # Verify that the volume of vali is the former loki volume
    kubectl exec vali-0 -c vali -- cat /data/hello

Hopefully the explanation above helps to follow the coding that is added in this
commit.

In case of unexpected errors, we accept that we can not reuse the Loki PV and
attempt to clean up the PV/PVC so that the next reconciliation can proceed with
ultimately creating a new PV for Vali.

Note that after we successfully deleted the Loki PVC, we can't easily find in
the next iteration the corresponding (unnamespaced) Loki PV, hence we try to
do the clean up in the first iteration.

One expected error cases is when the 30s deadline of the shoot reconciliation is
reached, so we use a new context with a 1 minute deadline for the more critical
steps in the second half of the flow and also for the recovery steps after an
unexpected error.

Note that the loki-0 pod currently can not terminate gracefully in a timely
manner: it is eventually killed by the kubelet after 30s. We reduce the graceful
termination timeout to 5 seconds when deleting Loki because based on its logs,
loki is itself waiting for 30s after an almost complete graceful shutdown in 1s.
By reducing the graceful termination timeout we can use the remaining time to
proceed with the flow above.

This commit can be reverted in a future release of Gardener.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

* Rename the loki folder on loki's disk to vali

This commit can be reverted in a future release of Gardener.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Co-authored-by: Shafeeque E S <shafeeque.e.s@sap.com>

* Delete the seed's garden Grafana artifacts

This commit can be reverted in a future release of Gardener.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

* Rename Loki's PVC to Vali and delete Loki in the garden namespace

This commit can be reverted in a future release of Gardener.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

* Allow the fluentbits to talk to loki during the transition phase

When the gardenlet is updated to this PR, but the shoots are not yet
reconciled, the fluentbits in the garden namespace are sending
the control plane logs to the logging service in the control plane
which still points to the loki-0 pod.

By adding this label, the generated network policies allow the network
communication between the fluentbits and loki-0.

When the shoot is reconciled, loki-0 is going to be replaced by vali-0,
and that case is already covered by the network policy label in the line
above.

This commit can be reverted in a future release of Gardener.

Co-authored-by: Niki Dokovski <nickytd@gmail.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

* Improve health check during the transition period

In the time period between a seed being upgraded to the release that contains
this PR and the shoot being reconciled, the seed's gardenlet should not expect
the Plutono/Vali artifacts to be healthy, but instead expect the old
Grafana/Loki deployment or statefulset to be there.

This commit can be reverted in a future release.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

* Use wait.PollUntilWithContext instead of a for loop

This commit can be reverted in a future release.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

* Stop the promtail service on the shoot nodes

The logs from the pods in the shoot cluster's kube-system namespace are sent to
Loki/Vali in the control plane (in the seed) by the promtail/valitail systemd
service that runs on each shoot node.

The promtail.service starts promtail, the auxiliary promtail-fetch-token.service
periodically fetches a bearer token that is used to authenticate promtail with
Loki.

This PR so far consistently renamed promtail to valitail, and hence when
applied, a new valitail.service and valitail-fetch-token.service will be added
to systemd on each shoot node. However, this generic rename approach does not
stop/clean up the old promtail systemd services which would continue to run on
the "updated" shoot nodes. The old promtail services would not "work" because
promtail is not compatible with Vali, but they would still run and unnecessarily
consume some memory (~50MB) on the shoot nodes.

This commit takes care of stopping the promtail services to avoid unnecessarily
using memory on the shoot nodes. Note that the promtail services are not
explicitly "removed" by this commit, they are just stopped: it is sufficient to
do so to prevent using resources unnecessarily. When the node is recreated due
to an OS/Kubernetes version upgrade or due to cluster autoscaling events, only
the new valitail services are going to be provisioned. In that pristine state,
upon startup, a single log line in journalctl will show that the valitail
service attempted to stop the no longer existing promtail services.

This commit can be reverted in a future release. Note that we need to stop the
promtail services only once on updated nodes, so it is fine to revert this
commit in the next Gardener release even if the shoot nodes are not recreated by
then.

Co-authored-by: Niki Dokovski <nickytd@gmail.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

* Address PR comments: add release v1.72 to removal TODOs

This commit can be reverted in a future release.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

* Push back the migration code removal to 6 releases in the future

That gives about 3 months time so that all the shoots on all the Gardener
landscapes can be migrated from Loki to Vali.

Note that the migration would happen in a healthy shoot in the very first
reconciliation window. If the shoot is broken (e.g. invalid shoot infrastructure
credentials) then the reconciliation will break on each attempt and that can
delay the migration from Loki to Vali indefinitely. We assume that 3 months are
sufficient so that eventually all the shoots are migrated and we can safely
remove the migration logic by that time.

This commit can be reverted in a future release.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

---------

Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Co-authored-by: Kristian-ZH <k.zhelyazkov@sap.com>
Co-authored-by: Niki Dokovski <nickytd@gmail.com>
Co-authored-by: Shafeeque E S <shafeeque.e.s@sap.com>
  • Loading branch information
5 people authored May 16, 2023
1 parent a57427e commit 48bd154
Show file tree
Hide file tree
Showing 189 changed files with 2,264 additions and 1,434 deletions.
15 changes: 15 additions & 0 deletions charts/gardener/gardenlet/templates/clusterrole-gardenlet.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,19 @@ rules:
- get
- list
- watch
- apiGroups: # TODO (istvanballok,rickardsjp): remove in release v1.77. Patching persistent volumes is needed for the Loki -> Vali migration.
- ""
resources:
- persistentvolumes
verbs:
- delete
- patch
- apiGroups:
- ""
resources:
- persistentvolumeclaims
verbs:
- create
- apiGroups:
- ""
resources:
Expand Down Expand Up @@ -90,6 +103,7 @@ rules:
- persistentvolumeclaims
resourceNames:
- alertmanager-db-alertmanager-0
- vali-vali-0
- loki-loki-0
- prometheus-db-prometheus-0
verbs:
Expand Down Expand Up @@ -214,6 +228,7 @@ rules:
- kube-controller-manager
- aggregate-prometheus
- prometheus
- vali
- loki
verbs:
- delete
Expand Down
40 changes: 20 additions & 20 deletions charts/images.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -229,10 +229,10 @@ images:
integrity_requirement: high
availability_requirement: low
comment: the node-exporter is also deployed to the shoot cluster
- name: grafana
sourceRepository: github.com/grafana/grafana
repository: eu.gcr.io/gardener-project/3rd/grafana/grafana
tag: "7.5.17"
- name: plutono
sourceRepository: github.com/credativ/plutono
repository: ghcr.io/credativ/plutono
tag: "v7.5.21"
labels:
- name: gardener.cloud/cve-categorisation
value:
Expand Down Expand Up @@ -391,10 +391,10 @@ images:
availability_requirement: 'low'
- name: fluent-bit-plugin-installer
resourceId:
name: fluent-bit-to-loki
name: fluent-bit-to-vali
sourceRepository: github.com/gardener/logging
repository: eu.gcr.io/gardener-project/gardener/fluent-bit-to-loki
tag: "v0.52.0"
repository: eu.gcr.io/gardener-project/gardener/fluent-bit-to-vali
tag: "v0.53.0"
labels:
- name: 'gardener.cloud/cve-categorisation'
value:
Expand All @@ -405,10 +405,10 @@ images:
integrity_requirement: 'none'
availability_requirement: 'none'
comment: no data is stored or processed by the installer
- name: loki
sourceRepository: github.com/grafana/loki
repository: eu.gcr.io/gardener-project/3rd/grafana/loki
tag: "2.2.1"
- name: vali
sourceRepository: github.com/credativ/vali
repository: ghcr.io/credativ/vali
tag: "v2.2.5"
labels:
- name: 'gardener.cloud/cve-categorisation'
value:
Expand All @@ -418,10 +418,10 @@ images:
confidentiality_requirement: 'high'
integrity_requirement: 'high'
availability_requirement: 'low'
- name: loki-curator
- name: vali-curator
sourceRepository: github.com/gardener/logging
repository: eu.gcr.io/gardener-project/gardener/loki-curator
tag: "v0.52.0"
repository: eu.gcr.io/gardener-project/gardener/vali-curator
tag: "v0.53.0"
labels:
- name: 'gardener.cloud/cve-categorisation'
value:
Expand All @@ -445,10 +445,10 @@ images:
integrity_requirement: 'high'
availability_requirement: 'low'
comment: kube-rbac-proxy is an authentication proxy working with credentials
- name: promtail
sourceRepository: github.com/grafana/loki
repository: eu.gcr.io/gardener-project/3rd/grafana/promtail
tag: "2.2.1"
- name: valitail
sourceRepository: github.com/credativ/vali
repository: ghcr.io/credativ/valitail
tag: "v2.2.5"
labels:
- name: 'gardener.cloud/cve-categorisation'
value:
Expand All @@ -464,7 +464,7 @@ images:
name: telegraf-iptables
sourceRepository: github.com/gardener/logging
repository: eu.gcr.io/gardener-project/gardener/telegraf-iptables
tag: "v0.52.0"
tag: "v0.53.0"
labels:
- name: 'gardener.cloud/cve-categorisation'
value:
Expand All @@ -480,7 +480,7 @@ images:
- name: event-logger
sourceRepository: github.com/gardener/logging
repository: eu.gcr.io/gardener-project/gardener/event-logger
tag: "v0.52.0"
tag: "v0.53.0"
labels:
- name: 'gardener.cloud/cve-categorisation'
value:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ tests:
seed: aws
input_series:
# FluentBitReceivesLogsWithoutMetadata
- series: 'fluentbit_loki_gardener_logs_without_metadata_total{pod="fluent-bit-test"}'
- series: 'fluentbit_vali_gardener_logs_without_metadata_total{pod="fluent-bit-test"}'
values: '0+0x3 0+1x30'
alert_rule_test:
- eval_time: 22m
Expand Down Expand Up @@ -92,19 +92,19 @@ tests:
exp_annotations:
description: >
fluent-bit-test on seed: aws sends OutOfOrder logs
to the Loki. These logs will be dropped.
to the Vali. These logs will be dropped.
summary: Fluent-bit sends OoO logs

- interval: 1m
external_labels:
seed: aws
input_series:
# FluentBitGardenerLokiPluginErrors
- series: 'fluentbit_loki_gardener_errors_total{pod="fluent-bit-test"}'
# FluentBitGardenerValiPluginErrors
- series: 'fluentbit_vali_gardener_errors_total{pod="fluent-bit-test"}'
values: '0+0x3 0+1x30'
alert_rule_test:
- eval_time: 22m
alertname: FluentBitGardenerLokiPluginErrors
alertname: FluentBitGardenerValiPluginErrors
exp_alerts:
- exp_labels:
pod: fluent-bit-test
Expand All @@ -114,7 +114,7 @@ tests:
visibility: operator
exp_annotations:
description: >
There are errors in the fluent-bit-test GardenerLoki plugin on seed:
There are errors in the fluent-bit-test GardenerVali plugin on seed:
aws.
summary: Errors in Fluent-bit GardenerLoki plugin
summary: Errors in Fluent-bit GardenerVali plugin

Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
rule_files:
- ../aggregate-prometheus-rules/loki.rules.yaml
- ../aggregate-prometheus-rules/vali.rules.yaml

evaluation_interval: 30s

Expand All @@ -8,19 +8,19 @@ tests:
external_labels:
seed: aws
input_series:
# LokiDown
- series: 'up{app="loki"}'
# ValiDown
- series: 'up{app="vali"}'
values: '0+0x30'
alert_rule_test:
- eval_time: 30m
alertname: LokiDown
alertname: ValiDown
exp_alerts:
- exp_labels:
service: logging
severity: warning
type: seed
visibility: operator
exp_annotations:
description: "There are no loki pods running on seed: aws. No logs will be collected."
summary: Loki is down
description: "There are no vali pods running on seed: aws. No logs will be collected."
summary: Vali is down

Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ groups:
expr: |
sum by (pod) (
increase(
fluentbit_loki_gardener_logs_without_metadata_total[4m]
fluentbit_vali_gardener_logs_without_metadata_total[4m]
)
) > 0
labels:
Expand Down Expand Up @@ -72,13 +72,13 @@ groups:
summary: Fluent-bit sends OoO logs
description: >
{{$labels.pod}} on seed: {{$externalLabels.seed}} sends OutOfOrder logs
to the Loki. These logs will be dropped.
to the Vali. These logs will be dropped.
- alert: FluentBitGardenerLokiPluginErrors
- alert: FluentBitGardenerValiPluginErrors
expr: |
sum by (pod) (
increase(
fluentbit_loki_gardener_errors_total[4m]
fluentbit_vali_gardener_errors_total[4m]
)
) > 0
labels:
Expand All @@ -87,8 +87,8 @@ groups:
type: seed
visibility: operator
annotations:
summary: Errors in Fluent-bit GardenerLoki plugin
summary: Errors in Fluent-bit GardenerVali plugin
description: >
There are errors in the {{$labels.pod}} GardenerLoki plugin on seed:
There are errors in the {{$labels.pod}} GardenerVali plugin on seed:
{{$externalLabels.seed}}.
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
groups:
- name: loki.rules
- name: vali.rules
rules:
- alert: LokiDown
expr: absent(up{app="loki"} == 1)
- alert: ValiDown
expr: absent(up{app="vali"} == 1)
for: 30m
labels:
service: logging
severity: warning
type: seed
visibility: operator
annotations:
description: "There are no loki pods running on seed: {{ .ExternalLabels.seed }}. No logs will be collected."
summary: Loki is down
description: "There are no vali pods running on seed: {{ .ExternalLabels.seed }}. No logs will be collected."
summary: Vali is down
4 changes: 0 additions & 4 deletions charts/seed-bootstrap/charts/loki/Chart.yaml

This file was deleted.

4 changes: 4 additions & 0 deletions charts/seed-bootstrap/charts/vali/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
apiVersion: "v1"
name: vali
version: 0.28.1
description: "Vali: like Prometheus, but for logs."
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{{- define "loki.config.data" -}}
loki.yaml: |-
{{- define "vali.config.data" -}}
vali.yaml: |-
auth_enabled: {{ .Values.authEnabled }}
ingester:
chunk_target_size: 1536000
Expand Down Expand Up @@ -29,17 +29,17 @@ loki.yaml: |-
http_listen_port: 3100
storage_config:
boltdb:
directory: /data/loki/index
directory: /data/vali/index
filesystem:
directory: /data/loki/chunks
directory: /data/vali/chunks
chunk_store_config:
max_look_back_period: 360h
table_manager:
retention_deletes_enabled: true
retention_period: 360h
curator.yaml: |-
LogLevel: info
DiskPath: /data/loki/chunks
DiskPath: /data/vali/chunks
TriggerInterval: 1h
InodeConfig:
MinFreePercentages: 10
Expand All @@ -49,7 +49,7 @@ curator.yaml: |-
MinFreePercentages: 10
TargetFreePercentages: 15
PageSizeForDeletionPercentages: 1
loki-init.sh: |-
vali-init.sh: |-
#!/bin/bash
set -o errexit

Expand All @@ -63,8 +63,8 @@ loki-init.sh: |-
tune2fs -O large_dir $(mount | gawk '{if($3=="/data") {print $1}}')
{{- end -}}

{{- define "loki.config.name" -}}
loki-config-{{ include "loki.config.data" . | sha256sum | trunc 8 }}
{{- define "vali.config.name" -}}
vali-config-{{ include "vali.config.data" . | sha256sum | trunc 8 }}
{{- end }}

{{- define "telegraf.config.data" -}}
Expand All @@ -90,7 +90,7 @@ telegraf.conf: |+
start.sh: |+
#/bin/bash

iptables -A INPUT -p tcp --dport {{ .Values.kubeRBACProxy.port }} -j ACCEPT -m comment --comment "promtail"
iptables -A INPUT -p tcp --dport {{ .Values.kubeRBACProxy.port }} -j ACCEPT -m comment --comment "valitail"
/usr/bin/telegraf --config /etc/telegraf/telegraf.conf
{{- end -}}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
apiVersion: autoscaling.k8s.io/v1alpha1
kind: Hvpa
metadata:
name: loki
name: vali
namespace: {{ .Release.Namespace }}
labels:
{{ toYaml .Values.labels | indent 4 }}
Expand All @@ -17,12 +17,12 @@ spec:
hpa:
selector:
matchLabels:
role: loki-hpa
role: vali-hpa
deploy: false
template:
metadata:
labels:
role: loki-hpa
role: vali-hpa
spec:
maxReplicas: {{ .Values.replicas }}
minReplicas: {{ .Values.replicas }}
Expand All @@ -38,7 +38,7 @@ spec:
vpa:
selector:
matchLabels:
role: loki-vpa
role: vali-vpa
deploy: true
scaleUp:
updatePolicy:
Expand All @@ -63,11 +63,11 @@ spec:
template:
metadata:
labels:
role: loki-vpa
role: vali-vpa
spec:
resourcePolicy:
containerPolicies:
- containerName: loki
- containerName: vali
controlledValues: RequestsOnly
maxAllowed:
memory: {{ .Values.hvpa.maxAllowed.memory }}
Expand All @@ -92,6 +92,6 @@ spec:
targetRef:
apiVersion: {{ include "statefulsetversion" . }}
kind: StatefulSet
name: loki
name: vali
{{ end }}
{{ end }}
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ metadata:
kubernetes.io/ingress.class: {{ .Values.ingress.class }}
{{- end }}
nginx.ingress.kubernetes.io/configuration-snippet: "proxy_set_header X-Scope-OrgID operator;"
name: loki
name: vali
namespace: {{ .Release.Namespace }}
labels:
{{ toYaml .Values.labels | indent 4 }}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# This service duplicates the soon to be removed loki service.
# This service duplicates the soon to be removed vali service.
# That serves the migration plan described at https://github.com/gardener/gardener/issues/7585
apiVersion: v1
kind: Service
Expand Down
Loading

0 comments on commit 48bd154

Please sign in to comment.