Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd memory usage spikes to unsustainable levels and OOMs #14362

Open
Uzay-G opened this issue Aug 19, 2022 · 10 comments
Open

etcd memory usage spikes to unsustainable levels and OOMs #14362

Uzay-G opened this issue Aug 19, 2022 · 10 comments

Comments

@Uzay-G
Copy link

Uzay-G commented Aug 19, 2022

What happened?

Whenever I run my etcd container, its memory usage slowly goes up to meet the docker limit on my machine (~7.6 GB) and then it OOMs. I've investigated and believe this is because it has created a large amount of wal files, above the limit (5), and when it has to process these it breaks down.

I've attached etcd docker logs.

etcdlogs.txt

What did you expect to happen?

etcd would run normally.

How can we reproduce it (as minimally and precisely as possible)?

I had inserted lots of data and now when I launch etcd it seems it can simply not handle the saved .wal files.

Anything else we need to know?

ls -has on the etcd wal directory:

total 4.1G
4.0K .					     62M 000000000000007b-0000000000050618.wal	 62M 0000000000000092-0000000000056f14.wal
4.0K ..					     62M 000000000000007c-0000000000050b41.wal	 62M 0000000000000093-00000000000573a7.wal
 62M 0.tmp				     62M 000000000000007d-0000000000051024.wal	 62M 0000000000000094-0000000000057988.wal
 62M 0000000000000067-0000000000049181.wal   62M 000000000000007e-000000000005160d.wal	 62M 0000000000000095-0000000000057d93.wal
 62M 0000000000000068-00000000000497f2.wal   62M 000000000000007f-0000000000051baf.wal	 62M 0000000000000096-0000000000058198.wal
 62M 0000000000000069-0000000000049deb.wal   62M 0000000000000080-00000000000520fa.wal	 62M 0000000000000097-00000000000585b8.wal
 62M 000000000000006a-000000000004a37d.wal   62M 0000000000000081-0000000000052873.wal	 62M 0000000000000098-00000000000589c5.wal
 62M 000000000000006b-000000000004aa3e.wal   62M 0000000000000082-0000000000052e7d.wal	 62M 0000000000000099-0000000000058d89.wal
 62M 000000000000006c-000000000004af7c.wal   62M 0000000000000083-000000000005333f.wal	 62M 000000000000009a-000000000005913b.wal
 62M 000000000000006d-000000000004b58c.wal   62M 0000000000000084-000000000005380b.wal	 62M 000000000000009b-000000000005948f.wal
 62M 000000000000006e-000000000004bae9.wal   62M 0000000000000085-0000000000053c20.wal	 62M 000000000000009c-0000000000059815.wal
 62M 000000000000006f-000000000004c0d3.wal   62M 0000000000000086-0000000000054004.wal	 62M 000000000000009d-0000000000059c02.wal
 62M 0000000000000070-000000000004c6af.wal   62M 0000000000000087-00000000000543ef.wal	 62M 000000000000009e-000000000005a010.wal
 62M 0000000000000071-000000000004ccb6.wal   62M 0000000000000088-00000000000547fe.wal	 62M 000000000000009f-000000000005a3c1.wal
 62M 0000000000000072-000000000004d2e7.wal   62M 0000000000000089-0000000000054b58.wal	 62M 00000000000000a0-000000000005a7a5.wal
 62M 0000000000000073-000000000004d82a.wal   62M 000000000000008a-0000000000054f52.wal	 62M 00000000000000a1-000000000005ab92.wal
 62M 0000000000000074-000000000004de82.wal   62M 000000000000008b-00000000000552a1.wal	 62M 00000000000000a2-000000000005b000.wal
 62M 0000000000000075-000000000004e429.wal   62M 000000000000008c-00000000000555ed.wal	 62M 00000000000000a3-000000000005b37c.wal
 62M 0000000000000076-000000000004e9d9.wal   62M 000000000000008d-00000000000559a6.wal	 62M 00000000000000a4-000000000005b744.wal
 62M 0000000000000077-000000000004ef8e.wal   62M 000000000000008e-0000000000055d8f.wal	 62M 00000000000000a5-000000000005bb88.wal
 62M 0000000000000078-000000000004f5db.wal   62M 000000000000008f-0000000000056252.wal	 62M 00000000000000a6-000000000005bf4b.wal
 62M 0000000000000079-000000000004fb26.wal   62M 0000000000000090-000000000005666c.wal	 62M 00000000000000a7-000000000005c32d.wal
 62M 000000000000007a-00000000000500bc.wal   62M 0000000000000091-0000000000056a5b.wal	 62M 00000000000000a8-000000000005c6d5.wal

Etcd version (please run commands below)

etcd v3.5.0 from docker

Etcd configuration (command line flags or environment variables)

      - ETCD_AUTO_COMPACTION_MODE=revision
      - ETCD_AUTO_COMPACTION_RETENTION=1000
      - ETCD_QUOTA_BACKEND_BYTES=4294967296
      - ETCD_ENABLE_PPROF=true
      - ETCD_SNAPSHOT_COUNT=50000
      - ETCD_MAX_SNAPSHOTS=2
      - ETCD_MAX_WALS=5

    command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

I cannot run these commands because etcd hangs and crashes.

Relevant log output

No response

@ahrtr
Copy link
Member

ahrtr commented Aug 22, 2022

The OOM is a known issue, the workaroud AFAIK are:

  1. perform compaction & defragmentation to reclaim the db file space;
  2. avoid big range operation on the client side.

With respect to the WAL files, it looks like an issue because the max WAL file is 5, but there are lots of WAL files. It means that the old WAL files failed to be purged. I checked the log file you attached, but did not see anything useful, the reason should be you attachd isn't the complete log. Please try to reproduce the issue and attached the complete log if possible.

@Uzay-G
Copy link
Author

Uzay-G commented Aug 23, 2022

Can I just delete the WAL files?

@ahrtr
Copy link
Member

ahrtr commented Aug 23, 2022

Can I just delete the WAL files?

It isn't recommended to manually delete the WAL files, otherwise the WAL files may not be matching the snap files. Please try to reproduce the issue and provide complete log. If you are interested, please try to figure out why etcd failed to purge the WAL files automatically.

@Uzay-G
Copy link
Author

Uzay-G commented Aug 27, 2022

When I try getting debug info:

halcyon@espial:~/milvus$ sudo docker exec milvus-etcd etcdctl member list -w table
{"level":"warn","ts":"2022-08-27T05:34:10.842Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002f2000/#initially=[127.0.0.1:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded

When I try to get the logs:

{"level":"info","ts":1661578407.7117872,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_AUTO_COMPACTION_MODE","variable-value":"revision"}
{"level":"info","ts":1661578407.722684,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_AUTO_COMPACTION_RETENTION","variable-value":"1000"}
{"level":"info","ts":1661578407.7227707,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_ENABLE_PPROF","variable-value":"true"}
{"level":"info","ts":1661578407.7228498,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_MAX_SNAPSHOTS","variable-value":"2"}
{"level":"info","ts":1661578407.7228956,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_MAX_WALS","variable-value":"5"}
{"level":"info","ts":1661578407.7229593,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_QUOTA_BACKEND_BYTES","variable-value":"4294967296"}
{"level":"info","ts":1661578407.7230117,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_SNAPSHOT_COUNT","variable-value":"50000"}
{"level":"info","ts":"2022-08-27T05:33:27.723Z","caller":"etcdmain/etcd.go:72","msg":"Running: ","args":["etcd","-advertise-client-urls=http://127.0.0.1:2379","-listen-client-urls","http://0.0.0.0:2379","--data-dir","/etcd"]}
{"level":"info","ts":"2022-08-27T05:33:27.727Z","caller":"etcdmain/etcd.go:115","msg":"server has been already initialized","data-dir":"/etcd","dir-type":"member"}
{"level":"info","ts":"2022-08-27T05:33:27.727Z","caller":"embed/etcd.go:131","msg":"configuring peer listeners","listen-peer-urls":["http://localhost:2380"]}
{"level":"info","ts":"2022-08-27T05:33:27.738Z","caller":"embed/etcd.go:139","msg":"configuring client listeners","listen-client-urls":["http://0.0.0.0:2379"]}
{"level":"info","ts":"2022-08-27T05:33:27.738Z","caller":"embed/etcd.go:598","msg":"pprof is enabled","path":"/debug/pprof"}
{"level":"info","ts":"2022-08-27T05:33:27.740Z","caller":"embed/etcd.go:307","msg":"starting an etcd server","etcd-version":"3.5.0","git-sha":"946a5a6f2","go-version":"go1.16.3","go-os":"linux","go-arch":"amd64","max-cpu-set":2,"max-cpu-available":2,"member-initialized":true,"name":"default","data-dir":"/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":50000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://localhost:2380"],"listen-peer-urls":["http://localhost:2380"],"advertise-client-urls":["http://127.0.0.1:2379"],"listen-client-urls":["http://0.0.0.0:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":4294967296,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"revision","auto-compaction-retention":"1µs","auto-compaction-interval":"1µs","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
{"level":"warn","ts":1661578407.740217,"caller":"fileutil/fileutil.go:57","msg":"check file permission","error":"directory \"/etcd\" exist, but the permission is \"drwxr-xr-x\". The recommended permission is \"-rwx------\" to prevent possible unprivileged access to the data"}
{"level":"info","ts":"2022-08-27T05:33:30.958Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/etcd/member/snap/db","took":"3.207592575s"}
{"level":"info","ts":"2022-08-27T05:34:47.258Z","caller":"etcdserver/server.go:508","msg":"recovered v2 store from snapshot","snapshot-index":300003,"snapshot-size":"7.1 kB"}
{"level":"info","ts":"2022-08-27T05:34:47.258Z","caller":"etcdserver/server.go:518","msg":"recovered v3 backend from snapshot","backend-size-bytes":123691008,"backend-size":"124 MB","backend-size-in-use-bytes":68882432,"backend-size-in-use":"69 MB"}

I'm trying to get more info but etcd hogs memory and I can't even use my system properly.
Does this help at all? Is there any way I can just reset it so it keeps existing data and ignores the problematic stuff?

@ahrtr
Copy link
Member

ahrtr commented Oct 12, 2022

The huge memory usage might be caused by the db file size.

What's the size of the db file, which locates in ${DATA_DIR}/member/snap/db? Please try to perform compaction + defragmentation per guide https://etcd.io/docs/v3.5/op-guide/maintenance/#space-quota

@ahrtr
Copy link
Member

ahrtr commented Oct 12, 2022

I just added more log for debugging the reason why etcd fails to purge WAL file.

#14576

@stale
Copy link

stale bot commented Mar 18, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 18, 2023
@stale stale bot removed the stale label May 10, 2023
@stale
Copy link

stale bot commented Aug 12, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@sxllwx
Copy link

sxllwx commented Aug 5, 2024

/sub

@No5stranger
Copy link

/mark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants