ETCD database space / quota exceeded, goes into maintenance mode #4005

KashifSaadat · 2017-12-04T11:46:15Z

Kops Version: kops v1.8.0-beta.2
Kubernetes Version: kubernetes v1.8.2
ETCD Version: v3.0.17 (TLS enabled)
Cloud Provider: AWS

Steps to recreate (will take time):

Create a Kubernetes Cluster on the versions specified above, using ETCD v3 with config similar to below (I had 5 members configured, just trimmed this spec so less spammy).
Need to give some operation time on the Cluster (creating lots of deployments, events etc).

  etcdClusters:
  - enableEtcdTLS: true
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master0-az0
      name: a-1
    - encryptedVolume: true
      instanceGroup: master1-az0
      name: a-2
    name: main
    version: 3.0.17
  - enableEtcdTLS: true
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master0-az0
      name: a-1
    - encryptedVolume: true
      instanceGroup: master1-az0
      name: a-2
    name: events
    version: 3.0.17

After some operation time, you may begin to see warnings such as below in the logs:

kubelet[1495]: W1204 11:17:02.533588    1495 status_manager.go:446] Failed to update status for pod "custom-pod-A": etcdserver: mvcc: database space exceeded
kubelet[1495]: W1204 11:17:02.542113    1495 status_manager.go:446] Failed to update status for pod "custom-pod-B": etcdserver: mvcc: database space exceeded
kubelet[1495]: W1204 11:17:02.551753    1495 status_manager.go:446] Failed to update status for pod "canal-hcldk_kube-system(C)": etcdserver: mvcc: database space exceeded
kubelet[1495]: W1204 11:17:02.557246    1495 status_manager.go:446] Failed to update status for pod "custom-pod-D": etcdserver: mvcc: database space exceeded
kubelet[1495]: W1204 11:17:02.565505    1495 status_manager.go:446] Failed to update status for pod "custom-pod-E": etcdserver: mvcc: database space exceeded
kubelet[1495]: \"sizeBytes\":746888}]}}" for node "ip-1-2-3-4.aws-region.compute.internal": etcdserver: mvcc: database space exceeded

Check ETCD Status:

~ # ${ETCD_CMD} --endpoints ${ETCD_ENDPOINT} alarm list
memberID:A alarm:NOSPACE
memberID:B alarm:NOSPACE
memberID:C alarm:NOSPACE

~ # ${ETCD_CMD} --endpoints ${ETCD_ENDPOINT} --write-out=table endpoint status
+------------------------+------------------+---------+---------+-----------+-----------+------------+
|        ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://localhost:4001 | 670630e06d36fd3c |  3.0.17 |  140 MB |      true |       358 |  120256658 |
+------------------------+------------------+---------+---------+-----------+-----------+------------+

~ # df -h | grep "master-vol"
/dev/xvdu         20G  442M   19G   3% /mnt/master-vol-A
/dev/xvdv         20G  419M   19G   3% /mnt/master-vol-B

According to the ETCD Maintenance Docs the cluster has gone into a limited operation maintenance mode, meaning that it will only accept key reads and deletes.

Recovery: History compaction needs to occur (and then possible defragmentation to release the free storage space for use) for it to be operational again, the steps for this are in the above docs link.

There are possible options we could supply to etcd via kops which will hopefully mitigate this issue and reduce manual user maintenance required (although I don't know much about etcd to be sure):

EtcdClusterSpec: Allow ETCD_QUOTA_BACKEND_BYTES to be configurable, so a higher value can be set rather than the default of 0 (0 defaults to low space quota)
EtcdClusterSpec: Allow ETCD_AUTO_COMPACTION_RETENTION to be configurable, so it can trigger automatically without user intervention.
- Could have some performance implications?
- If we were to support this, should we default it to be enabled for new clusters?
- Does periodic defragmentation still need to occur?

EDIT: 1 of the 5 nodes had etcd volume maxed out at 100%, due to a dodgy deployment. The other 4 were only 3% utilised as shown in the above log snippets.

Ping @gambol99 @justinsb @chrislovecnm

The text was updated successfully, but these errors were encountered:

justinsb · 2017-12-04T14:30:55Z

So the apiserver issues a compaction every 5 minutes (IIRC). I don't understand exactly the cause, but it looks like an etcd bug. Related:

kubernetes/kubernetes#45037
etcd-io/etcd#8009
etcd-io/etcd#7116

It sounds like an etcd bug, @lavalamp asked for a backport and was told "no", but the fix will be in etcd 3.3.

KashifSaadat · 2017-12-04T17:32:39Z

Cheers, that's probably the cause of it then!

In regards to apiserver doing the compaction every 5 minutes, shouldn't this mean that the other 4 nodes with disk space remaining should have remained operational? Or maybe we still needed to do the defrag to reclaim the free space on the members / clear the alarms that had triggered?

KashifSaadat · 2017-12-04T17:59:06Z

If anyone runs into the above issue, you can attempt to follow the below very rough recovery steps that I took (tested on CoreOS).

Run this on each of the members affected, which still have available space on the etcd volume:

export ETCDCTL_API=3
ETCD_ENDPOINT_MAIN="https://localhost:4001"
ETCD_ENDPOINT_EVENTS="https://localhost:4002"
CA_FILE="/srv/kubernetes/ca.crt"
ETCD_CMD="etcdctl --cacert ${CA_FILE}"
rev=`${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*'`
${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} compact $rev
${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} defrag
${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} alarm disarm

For any members that have experienced the above mentioned bug, where volume is at 100% (not entirely sure whether steps 2-5 are necessary in all cases):

Find the affected member in AWS, terminate the associated ASG and 2x attached EBS Volumes (etcd, etcd-events)
On one of the healthy-ish members, get the etcd member list: ${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} member list
Remove the dead member (should have the same tag name as the ASG / instance you deleted): ${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} member remove <member-id-from-above-command>
Add the member back in, will be in an un-started state: ${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} member add <etcd-member-name> --peer-urls="https://<etcd-member-name>.internal.${KOPS_CLUSTER_NAME}:2380"
Repeat Steps 2-5 for ${ETCD_ENDPOINT_EVENTS}. <etcd-member-name> will differ and port will be 2381 rather than 2380.
kops update cluster ${KOPS_CLUSTER_NAME} --yes (this will re-create the ASG and volumes)
Once the new master has started, ssh into the instance
Run as root: systemctl stop kubelet && systemctl stop protokube
Edit both/etc/kubernetes/manifests/etcd.manifest and /etc/kubernetes/manifests/etcd-events.manifest, change ETCD_INITIAL_CLUSTER_STATE value to existing
Drop the docker containers: docker kill $(docker ps | grep "etcd" | awk '{print $1}')
For both the etcd volumes, remove the member dirs:
- rm -rf /mnt/master-vol-<vol-id-main>/var/etcd/data/member
- rm -rf /mnt/master-vol-<vol-id-events>/var/etcd/data-events/member
Start kubelet: systemctl start kubelet. Wait for the cluster to report healthy again (check etcd member list, kops validate cluster etc).
Start protokube again: systemctl start protokube
Once the cluster is all healthy, slowly terminate the masters one by one (giving time for the cluster to recover), to ensure they are all in a clean state.

The above steps were modified slightly from following this guide: https://github.com/kubernetes/kops/blob/master/docs/single-to-multi-master.md#4---add-the-third-master

KashifSaadat · 2018-02-02T18:01:25Z

v3.3.0 has officially been released. The following PR should correct issues with logging and pick up version changes for a rolling update: #4371

I'll be testing this out and will see how it goes!

KashifSaadat · 2018-03-02T16:26:44Z

Tempted to close this issue now.. ETCD v3.3.0 appears to resolve this issue. I'm running a cluster on the newer version and haven't noticed any problems so far (including the PR referenced above).

Just a note, with kops you'll need to define the new version as follows in your kops spec:

  etcdClusters:
  - etcdMembers:
     ...
    enableEtcdTLS: true
    image: gcr.io/etcd-development/etcd:v3.3.0
    name: main
    version: 3.3.0

The version field doesn't need to be identical to the image, so long as it's 3.x.x.

@justinsb anything more you think we need to do here, or happy to close this?

fejta-bot · 2018-05-31T17:22:19Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-06-30T18:07:44Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-07-30T18:55:51Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

voigt · 2019-11-14T16:24:55Z

Run this on each of the members affected, which still have available space on the etcd volume:

export ETCDCTL_API=3
ETCD_ENDPOINT_MAIN="https://localhost:4001"
ETCD_ENDPOINT_EVENTS="https://localhost:4002"
CA_FILE="/srv/kubernetes/ca.crt"
ETCD_CMD="etcdctl --cacert ${CA_FILE}"
rev=`${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*'`
${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} compact $rev
${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} defrag
${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} alarm disarm

Sorry to resurrect this old issue. I just fell into it.

When I did the etcdctl [...] defrag I always got this into the error:
Failed to defragment etcd member[https://127.0.0.1:4001] (context deadline exceeded)

Setting the flag --command-timeout=120s solved this issue for me.

Hope that I could save someone some time.

jsonmp-k8 · 2020-10-14T14:50:14Z

Does KOPS support this --quota-backend-bytes param for etcd ?

olemarkus · 2020-10-15T08:13:41Z

You can specify which ENV vars to pass on to etcd: https://kops.sigs.k8s.io/cluster_spec/#etcdclusters
So you just have to set ETCD_QUOTA_BACKEND_BYTES there.

jsonmp-k8 · 2020-10-15T12:38:06Z

Thanks @olemarkus

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 31, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 30, 2018

k8s-ci-robot closed this as completed Jul 30, 2018

zapman449 mentioned this issue Oct 13, 2018

etcd compaction stopped. #5936

Closed

kvaps mentioned this issue Oct 25, 2021

etcdserver: mvcc: database space exceeded etcd-io/etcd#11947

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ETCD database space / quota exceeded, goes into maintenance mode #4005

ETCD database space / quota exceeded, goes into maintenance mode #4005

KashifSaadat commented Dec 4, 2017 •

edited

Loading

justinsb commented Dec 4, 2017

KashifSaadat commented Dec 4, 2017

KashifSaadat commented Dec 4, 2017 •

edited

Loading

KashifSaadat commented Feb 2, 2018

KashifSaadat commented Mar 2, 2018 •

edited

Loading

fejta-bot commented May 31, 2018

fejta-bot commented Jun 30, 2018

fejta-bot commented Jul 30, 2018

voigt commented Nov 14, 2019

jsonmp-k8 commented Oct 14, 2020

olemarkus commented Oct 15, 2020

jsonmp-k8 commented Oct 15, 2020

ETCD database space / quota exceeded, goes into maintenance mode #4005

ETCD database space / quota exceeded, goes into maintenance mode #4005

Comments

KashifSaadat commented Dec 4, 2017 • edited Loading

justinsb commented Dec 4, 2017

KashifSaadat commented Dec 4, 2017

KashifSaadat commented Dec 4, 2017 • edited Loading

KashifSaadat commented Feb 2, 2018

KashifSaadat commented Mar 2, 2018 • edited Loading

fejta-bot commented May 31, 2018

fejta-bot commented Jun 30, 2018

fejta-bot commented Jul 30, 2018

voigt commented Nov 14, 2019

jsonmp-k8 commented Oct 14, 2020

olemarkus commented Oct 15, 2020

jsonmp-k8 commented Oct 15, 2020

KashifSaadat commented Dec 4, 2017 •

edited

Loading

KashifSaadat commented Dec 4, 2017 •

edited

Loading

KashifSaadat commented Mar 2, 2018 •

edited

Loading