Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM leading to stuck servers #2254

Closed
Levatius opened this issue Mar 22, 2018 · 19 comments
Closed

OOM leading to stuck servers #2254

Levatius opened this issue Mar 22, 2018 · 19 comments
Labels
kind/bug Something is broken.

Comments

@Levatius
Copy link

Levatius commented Mar 22, 2018

Cluster info:

  • Kubernetes: 8 nodes total, each 32GB memory.
  • Dgraph: 1 zero and 3 servers, each runs on a separate node. Running master image. Memory flag for servers set to 24GB.

Observed issue:
Linear increase in memory usage for each server. When a server hits the request limit (30GB) it restarts. This at first recovers fine but around the 3rd spike it fails to restart and gets stuck.
There are two ways the pipeline can be stuck:

  1. Incoming mutates timeout
  2. No connection to the servers can be made

These scenarios seem mutually exclusive (if no connection can be made no mutate can be attempted).

Questions:

  • Why does memory usage continue past the 24GB flag?
  • Is 24GB not enough for each server or is it failing to write items to disk?

Example:
Server 0 gets stuck at the 3rd memory spike.
logs_server-0.txt
server-0
logs_server-1.txt
server-1
logs_server-2.txt
server-2
logs_zero.txt

I had some trouble getting the heap profile unfortunately, it was returning a malformed HTTP response. It is probably something our end at fault, will try again.

@Levatius
Copy link
Author

@jimanvlad @pawanrawal

@Levatius
Copy link
Author

I could get heap profiles for server-1 and server-2:
server-1.gz
server-2.gz
However, server-0 (the one that is stuck) is in a state such that pprof does not function:

A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

Important:
It seems that Kubernetes gives up on trying to restore the pod after the 3rd spike claiming that the system is OOM. If we could understand why the servers are going over their allotted resources and resolve this then I think the pipeline will remain unstuck.

@Levatius
Copy link
Author

This is the exact command we use to spawn our servers:
dgraph server --my=$(hostname -f):7080 --memory_mb 24576 --zero dgraph-zero-0.dgraph-zero.default.svc.cluster.local:5080 --posting_tables memorymap

@pawanrawal pawanrawal self-assigned this Mar 22, 2018
@pawanrawal pawanrawal added the investigate Requires further investigation label Mar 22, 2018
@pawanrawal
Copy link
Contributor

pawanrawal commented Mar 23, 2018

One of the things is that memory_mb doesn't limit the total amount of memory the server can use. In fact, we are going to rename it to lru_mb, see #2245. It is the size of the LRU cache maintained by Dgraph, so if your server has 32GB memory, I'd suggest setting memory_mb to something between 10-15GB.

I can see from the heap profile that there are a lot of entries in the LRU cache which are causing it to go OOM.

@pawanrawal
Copy link
Contributor

Another interesting thing is that at some point snapshotting still gets blocked. I could see these logs on logs_server-2.txt.

2018/03/22 14:14:57 draft.go:682: Couldn't take snapshot, txn watermark: [7039], applied watermark: [28973]
2018/03/22 14:15:27 draft.go:682: Couldn't take snapshot, txn watermark: [7039], applied watermark: [42551]
2018/03/22 14:15:57 draft.go:682: Couldn't take snapshot, txn watermark: [7039], applied watermark: [56633]
2018/03/22 14:16:27 draft.go:682: Couldn't take snapshot, txn watermark: [7039], applied watermark: [69911]
2018/03/22 14:16:57 draft.go:682: Couldn't take snapshot, txn watermark: [7039], applied watermark: [83361]
2018/03/22 14:17:27 draft.go:682: Couldn't take snapshot, txn watermark: [7039], applied watermark: [97449]

Interestingly, this gets fixed after a restart. I am checking why this would happen in the first place.

@pawanrawal
Copy link
Contributor

I have not been able to reproduce the Couldn't take snapshot issue with the current master. I do see it in the logs sometimes while running locally which is ok but a node is able to recover from it and I don't see it stuck at the same txn watermark for long.

Can you try with the master image and see if you still get it? If you do, then some steps to reproduce would be helpful.

@manishrjain manishrjain added kind/question Something requiring a response. kind/bug Something is broken. and removed investigate Requires further investigation labels Mar 29, 2018
@pawanrawal
Copy link
Contributor

I have a docker image at dgraph/dgraph:dev which prints more logs. Would be great if you could run it through this and share some logs while this issue happens.

@manishrjain
Copy link
Contributor

@Levatius: Can you please try this out, before we close this issue for inactivity?

@Levatius
Copy link
Author

Levatius commented Apr 3, 2018

Apologies for the inactivity, will run a test today with the :dev image.

@Levatius
Copy link
Author

Levatius commented Apr 3, 2018

Hi again,

We found that the :dev image was creating too many logs to effectively send (using :master instead).

We ran into an old issue related to a restarted server (server-2) being unable to become the leader of its group:

Logs:
logs_zero.txt
logs_server-0.txt
logs_server-1.txt
logs_server-2.txt

Heaps:
server-0.gz
server-1.gz
server-2.gz

@pawanrawal
Copy link
Contributor

pawanrawal commented Apr 4, 2018

I can see the Couldn't take snapshot errors which is what the debug logs in the dev image intend to help solve. This error ultimately causes the node to crash and restart (which takes time since all pending entries which couldn't be snapshotted have to be replayed).

2018/04/03 14:41:52 draft.go:696: Couldn't take snapshot, txn watermark: [299315], applied watermark: [731475]
2018/04/03 14:42:22 draft.go:696: Couldn't take snapshot, txn watermark: [299315], applied watermark: [737387]
2018/04/03 14:42:52 draft.go:696: Couldn't take snapshot, txn watermark: [299315], applied watermark: [748053]
2018/04/03 14:43:22 draft.go:696: Couldn't take snapshot, txn watermark: [299315], applied watermark: [754792]
2018/04/03 14:43:52 draft.go:696: Couldn't take snapshot, txn watermark: [299315], applied watermark: [766448]

Can you confirm the dgraph version you ran this with? Did you have to change memory_mb to lru_mb? This was a recent change in master. I just want to make sure that you are not using a cached docker image.

I am trying to load our 21M RDF dataset on a kubernetes cluster deployed on EC2 to try and reproduce this. In the meantime, if you could share a way to replicate this or share logs from the dev image for even 1 hour of the run (till the error starts appearing and stays for 5-10 mins), that would be helpful.

@Levatius
Copy link
Author

Levatius commented Apr 4, 2018

We did indeed have to change memory_mb to lru_mb in that last test (implying it was indeed :master).

We are running the :dev image at the moment and have yet to experience any dgraph-server restarts (~4h runtime). There does seem to be a noticeable slowdown in throughput when using this image however, is this due to the additional logging?

I will leave the test running until it encounters an issue. Once it does, I will post back here with a snapshot of the logs.

@pawanrawal
Copy link
Contributor

There does seem to be a noticeable slowdown in throughput when using this image however, is this due to the additional logging?

Yup, its due to logging for sure nothing else changed. It could be that the additional logging is masking the issue. I can probably move the logging to a file which should be faster and share an updated image.

@pawanrawal
Copy link
Contributor

I have updated the dgraph/dgraph:dev docker image. It logs the debug logs to a separate file for both zero and server. It would be great if you can run it and share the debug logs.

@pawanrawal
Copy link
Contributor

Ran the nightly binary for 75 mins on a cluster with 3 nodes serving 3 different groups, loading 40M RDF's using dgraph live and didn't notice anything suspicious.

@Levatius
Copy link
Author

Levatius commented Apr 9, 2018

We ran the previous :dev image over the weekend, the last 1000 entries for each server look like so:

2018/04/09 15:23:56 draft.go:808: Couldn't take snapshot, txn watermark: [673443], applied watermark: [2425895]

None of the server pods or zero pod restarted but it is unable to insert anymore data (mutates are timing out).

We will redeploy with the latest :dev image.

@pawanrawal
Copy link
Contributor

pawanrawal commented Apr 10, 2018

2018/04/09 15:23:56 draft.go:808: Couldn't take snapshot, txn watermark: [673443], applied watermark: [2425895]

See if it happens with the latest dev image. If it does then please share the debug logs created by Server and Zero along with logs from stdout.

@manishrjain
Copy link
Contributor

Looks like there's been no activity on this for a while. If you run with master, do you still see this issue?

@manishrjain manishrjain added verify-fix and removed kind/question Something requiring a response. labels Jun 14, 2018
@manishrjain
Copy link
Contributor

Just pushed #2448 as well, which should further allow decreasing the RAM usage. You can set --badger.vlog=disk to serve value log from disk, to decrease RAM consumption. Closing this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something is broken.
Development

No branches or pull requests

3 participants