☂️ Gardener ETCD Operator a.k.a. ETCD Druid

**Feature (What you would like to be added):**
Summarise the roadmap for `etcd-druid` with links to the corresponding issues.

**Motivation (Why is this needed?):**
A central place to collect the roadmap as well as the progress.

**Approach/Hint to the implement solution (optional):**
- [ ] Basic Controller
  - [x] Define CRD types
  - [X] Implement basic controller to deploy `StatefulSet` (with `replicas: 1`) with the containers for `etcd` and `etcd-backup-restore` the same way it is being done now.
  - [x] Unit tests
  - [x] Integration tests
- [x] Propagate `etcd` defragmentation schedule from the CRD to `etcd-backup-restore` sidecar container.
- [ ] Trigger full snapshot before hibernation/scale down. 
- [x] Backup compaction
  - Incremental/continuous backup is used for finer granularity backup (in the order of minutes) with full snapshots being taken at a much larger intervals (in the order of hours). This makes the backup efficient both in terms of disk, network bandwidth and backup storage space utilization as well as compute resource utilisation during backup.
  - If the proportion of changes in the incremental backup is large then this impacts the restoration times because incremental backups can only be restored in sequence
  - [#61@etcd-backup-restore](https://github.com/gardener/etcd-backup-restore/issues/61).
- [ ] Multi-node `etcd` cluster
  - [ ] All `etcd` nodes within the same Kubernetes cluster.
    - I.e., one CRD instance would provision multiple `etcd` nodes in the same Kubernetes cluster/namespace as the CRD instance.
    - [ ] Enhance CRD types to address the use-case
    - [ ] `Scale` sub-resource implementation for the current CRD
    - [ ] Addi/promote `etcd` learners/members during scale up, including quorum adjustment.
    - [ ] Remove `etcd` members during scale down, including quorum adjustment.
    - [ ] Handle backup/restore in the different states of the `etcd` cluster
    - [ ] Multi-AZ support
      - I.e. `etcd` nodes distributed across availability zones in the hosting Kubernetes cluster
  - [ ] Each `etcd` node in a different Kubernetes cluster.
    - I.e. each `etcd` node will be provisioned via a separate CRD instance in a different Kubernetes cluster but these nodes will be configured to find each other to form an `etcd` cluster.
    - There will be as many CRD instances as the number of nodes in the `etcd` cluster. 
    - [#233@gardener](https://github.com/gardener/gardener/issues/233).
    - [ ] Enhance CRD types to address the use-case
    - [ ] Add/promote `etcd` learners/members during scale up, including quorum adjustment.
    - [ ] Remove `etcd` members during scale down, including quorum adjustment.
    - [ ] Handle backup/restore in the different states of the `etcd` cluster
- [ ] Non-disruptive Autoscaling
  - The [`VerticalPodAutoscaler`](https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler) supports multiple update policies including `recreate`, `initial` and `off`.
  - The `recreate` policy is clearly not suitable for a single-node `etcd` instances because of the implications on frequent, unpredictable and unmanaged down-time.
  - The `initial` policy does not make sense for `etcd` considering the longer database verification time for non-graceful shutdown.
  - For a single-node `etcd` instance, vertical scaling via the [`VerticalPodAutoscaler`](https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler) would always be disruptive because of [the way scaling is done](https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler#components-of-vpa) by VPA. It gives no opportunity to take action before the `etcd` `pod(s)` are disrupted for scaling.
  - A controller can co-ordinate the `etcd`-specific steps to mitigate the disruption during (vertical) scaling if an [alternative way](https://github.com/gardener/hvpa-controller) is used to vertically scale a CRD instead of the individual `pods` directly.
- [ ] Non-disruptive Updates
  - For a single-node `etcd` instance, updates would be disruptive.
  - A controller can co-ordinate the `etcd`-specific steps to mitigate the disruption during updates.
- [ ] Database Restoration
  - Database restoration is also currently done on startup (or a restart) (if database verification fails) within the same backup-restore sidecar's main process.
  - Introducing a controller enables the option to perform database restoration as a separate job.
  - The main advantage of this approach is to decouple the memory requirement of a database restoration from the regular backup (full and delta) tasks.
  - This could be especially of interest because the delta snapshot restoration requires an embedded `etcd` instance which might mean that the memory requirement for database restoration is almost certain to be proportionate to the database size. However, the memory requirement for backup (full and delta) need not be proportionate to the database size at all. In fact, it is very realistic to expect that the memory requirement for backup be more or less independent of the database size.
- [ ] Migration for major updates
  - Data and/or backup migration during major updates which change the data and/or backup format or location.
- [ ] Backup Health Verification
  - Currently, we rely on the database backups in the storage provider to remain healthy. There are no additional checks to verify if the backups are still healthy after upload.
  - A controller can be used to perform such backup health verification asynchronously.
- [x] https://github.com/gardener/etcd-druid/issues/505
  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

☂️ Gardener ETCD Operator a.k.a. ETCD Druid #2

amshuman-kr
openedon Sep 27, 2019

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

☂️ Gardener ETCD Operator a.k.a. ETCD Druid #2

Description

amshuman-kropenedon Sep 27, 2019

Metadata