Scheduler helm charts: always create a PVC #7977

elena-kolevska · 2024-08-09T18:07:13Z

The scheduler needs a durable data storage to store all the jobs data, so we must provide a PV for that purpose.
If a storageClass is not provided, the volume will be created on the default storageClass.
If the cluster does not have a default storage class specified, the scheduler pods will not start up and the helm install will fail.

Signed-off-by: Elena Kolevska <elena@kolevska.com>

Signed-off-by: Artur Souza <asouza.pro@gmail.com> Change default storage size of scheduler to 1Gi Signed-off-by: Artur Souza <asouza.pro@gmail.com>

Signed-off-by: Artur Souza <asouza.pro@gmail.com>

ItalyPaleAle · 2024-08-12T17:40:47Z

charts/dapr/charts/dapr_scheduler/templates/dapr_scheduler_statefulset.yaml

  volumeClaimTemplates:
  - metadata:
      name: dapr-scheduler-data-dir
    spec:
      accessModes: [ "ReadWriteOnce" ]
+      {{- if .Values.cluster.storageClassName }}


This could be a problem.

On various cloud providers, persistent volumes are add-ons that customers are charged for.
Is there a way to not have a persistent volume? Could this be made optional, like for the placement service?

CC: @yaron2 @artursouza @elena-kolevska

I agree, we should definitely have the ability to not require a persistent volume claim, and personally would assume this to be the default as it was previously. This also changes the per-reqs for installing Dapr where Storage class providers are also not always available on all Kubernetes platforms.

This was changed because etcd requires durable storage when running in HA mode, and unfortunately, this is a strong requirement. If a cluster was created in HA mode without a custom storageClass and one of the pods was restarted for any reason, the etcd server couldn't rejoin the cluster because it couldn't find its data directory (which had been created on the pod's ephemeral drive by default).

This problem doesn't happen when running in single-node (non-HA) clusters because when the data dir is lost, all information about previous state is lost with it, so after restart the node will think it's booting up in a new cluster.
So when running in HA mode we have to have durable storage if we want to keep the nodes' identities, etcd was designed that way. It's not possible for an etcd node to rejoin a cluster with a previously known identity if it doesn't have the matching data-dir.

So while it is technically possible for us to remove any durable storage requirement, it can only work for non-HA. And we need to be aware that we open up a big possibility for data loss, so that would need to be made abundantly clear in the docs. If a single-node cluster is restarted, all the scheduled jobs and reminders data will be gone, together with the pod's ephemeral drive. That is a very final and irreversible event, compared to helm chart errors on install or upgrade, so I guess we have to find the right balance point between ease of use and reliability.

This was changed because etcd requires durable storage when running in HA mode, and unfortunately, this is a strong requirement. If a cluster was created in HA mode without a custom storageClass and one of the pods was restarted for any reason, the etcd server couldn't rejoin the cluster because it couldn't find its data directory

This is one of the (various) reasons why I was strongly advising not using etcd the entire time :)

Aside from being costly on many cloud providers, this seems very fragile too. Ideally, the etcd cluster should have a way to allow nodes to join and leave. Otherwise, what happens if the volume gets lost, even transiently (e.g. the physical node that has the volume goes offline temporarily)?

From an ops standpoint, Dapr cluster admins now need to be concerned with a stateful control plane service too, which must be run within the cluster itself (unlike databases/Redis/etc which many (most?) Dapr users leverage as PaaS services).

This was changed because etcd requires durable storage when running in HA mode, and unfortunately, this is a strong requirement. If a cluster was created in HA mode without a custom storageClass and one of the pods was restarted for any reason, the etcd server couldn't rejoin the cluster because it couldn't find its data directory

This is one of the (various) reasons why I was strongly advising not using etcd the entire time :)

Aside from being costly on many cloud providers, this seems very fragile too. Ideally, the etcd cluster should have a way to allow nodes to join and leave. Otherwise, what happens if the volume gets lost, even transiently (e.g. the physical node that has the volume goes offline temporarily)?

From an ops standpoint, Dapr cluster admins now need to be concerned with a stateful control plane service too, which must be run within the cluster itself (unlike databases/Redis/etc which many (most?) Dapr users leverage as PaaS services).

The counter argument is that state has become a critical point for Dapr, not only for actors / actor reminders but now also for workflows, jobs and upcoming planned features like delayed pub/sub etc - and that at this point state needs to become a first class citizen in Dapr where the project is in full control of how its being operated, maintained and observed in order to guarantee the most consistent experience in terms of performance, security and behavior. Guaranteeing consistency for multiple variants of PaaS services in different clouds of different distributions and different technologies is very difficult and while it can work well for generic APIs like state store (with more or less success), its unlikely to work at such a level for state that underpins Dapr's own APIs. Remaining in full control of managing state is in my opinion worth the trade off of a StatefulSet with a PVC, which is in itself a concept most Kubernetes operators are familiar with anyway when running stateful workloads.

* Always try to create a PVC Signed-off-by: Elena Kolevska <elena@kolevska.com> * better empty check Signed-off-by: Elena Kolevska <elena@kolevska.com> * Reduce disk size for scheduler Signed-off-by: Artur Souza <asouza.pro@gmail.com> Change default storage size of scheduler to 1Gi Signed-off-by: Artur Souza <asouza.pro@gmail.com> * Reduce scheduler storageSize again Signed-off-by: Artur Souza <asouza.pro@gmail.com> * dapr_scheduler.cluster.storageSize=30Mi Signed-off-by: Artur Souza <asouza.pro@gmail.com> * Changing storage size for redis and scheduler. Signed-off-by: Artur Souza <asouza.pro@gmail.com> * Reduce volume size for kafka, postgres and rabbitmq Signed-off-by: Artur Souza <asouza.pro@gmail.com> * Try to free up more disk space Signed-off-by: Artur Souza <asouza.pro@gmail.com> --------- Signed-off-by: Elena Kolevska <elena@kolevska.com> Signed-off-by: Artur Souza <asouza.pro@gmail.com> Co-authored-by: Artur Souza <asouza.pro@gmail.com> Signed-off-by: Jake Engelberg <jake@diagrid.io>

elena-kolevska added 2 commits August 9, 2024 19:06

Always try to create a PVC

75fec15

Signed-off-by: Elena Kolevska <elena@kolevska.com>

better empty check

5711109

Signed-off-by: Elena Kolevska <elena@kolevska.com>

artursouza force-pushed the fix/scheduler-charts branch from 7303f89 to eb7d647 Compare August 9, 2024 21:29

Reduce disk size for scheduler

43ac16a

Signed-off-by: Artur Souza <asouza.pro@gmail.com> Change default storage size of scheduler to 1Gi Signed-off-by: Artur Souza <asouza.pro@gmail.com>

artursouza force-pushed the fix/scheduler-charts branch from ae03059 to 43ac16a Compare August 9, 2024 22:18

artursouza added 2 commits August 9, 2024 16:47

Reduce scheduler storageSize again

32aa046

Signed-off-by: Artur Souza <asouza.pro@gmail.com>

dapr_scheduler.cluster.storageSize=30Mi

fa40505

Signed-off-by: Artur Souza <asouza.pro@gmail.com>

artursouza force-pushed the fix/scheduler-charts branch from b8edff0 to fa40505 Compare August 10, 2024 01:52

artursouza added 2 commits August 9, 2024 19:01

Changing storage size for redis and scheduler.

0f67229

Signed-off-by: Artur Souza <asouza.pro@gmail.com>

Reduce volume size for kafka, postgres and rabbitmq

27e52a8

Signed-off-by: Artur Souza <asouza.pro@gmail.com>

artursouza force-pushed the fix/scheduler-charts branch from 7ae7a78 to 49fec62 Compare August 10, 2024 07:01

Try to free up more disk space

33baea0

Signed-off-by: Artur Souza <asouza.pro@gmail.com>

artursouza force-pushed the fix/scheduler-charts branch from 49fec62 to 33baea0 Compare August 10, 2024 07:11

elena-kolevska marked this pull request as ready for review August 10, 2024 23:10

elena-kolevska requested review from a team as code owners August 10, 2024 23:10

elena-kolevska changed the title ~~Fix PVC for scheduler~~ Scheduler helm charts: always create a PVC Aug 10, 2024

artursouza approved these changes Aug 11, 2024

View reviewed changes

artursouza merged commit 0fd0468 into dapr:release-1.14 Aug 11, 2024
30 checks passed

ItalyPaleAle reviewed Aug 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler helm charts: always create a PVC #7977

Scheduler helm charts: always create a PVC #7977

elena-kolevska commented Aug 9, 2024 •

edited

Loading

ItalyPaleAle Aug 12, 2024

JoshVanL Aug 12, 2024

elena-kolevska Aug 12, 2024 •

edited

Loading

ItalyPaleAle Aug 12, 2024

yaron2 Aug 12, 2024

Scheduler helm charts: always create a PVC #7977

Scheduler helm charts: always create a PVC #7977

Conversation

elena-kolevska commented Aug 9, 2024 • edited Loading

ItalyPaleAle Aug 12, 2024

Choose a reason for hiding this comment

JoshVanL Aug 12, 2024

Choose a reason for hiding this comment

elena-kolevska Aug 12, 2024 • edited Loading

Choose a reason for hiding this comment

ItalyPaleAle Aug 12, 2024

Choose a reason for hiding this comment

yaron2 Aug 12, 2024

Choose a reason for hiding this comment

elena-kolevska commented Aug 9, 2024 •

edited

Loading

elena-kolevska Aug 12, 2024 •

edited

Loading