-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM leading to stuck servers #2254
Comments
I could get heap profiles for server-1 and server-2:
Important: |
This is the exact command we use to spawn our servers: |
One of the things is that I can see from the heap profile that there are a lot of entries in the LRU cache which are causing it to go OOM. |
Another interesting thing is that at some point snapshotting still gets blocked. I could see these logs on
Interestingly, this gets fixed after a restart. I am checking why this would happen in the first place. |
I have not been able to reproduce the Can you try with the |
I have a docker image at |
@Levatius: Can you please try this out, before we close this issue for inactivity? |
Apologies for the inactivity, will run a test today with the :dev image. |
Hi again, We found that the :dev image was creating too many logs to effectively send (using :master instead). We ran into an old issue related to a restarted server (server-2) being unable to become the leader of its group: Logs: Heaps: |
I can see the
Can you confirm the I am trying to load our 21M RDF dataset on a kubernetes cluster deployed on EC2 to try and reproduce this. In the meantime, if you could share a way to replicate this or share logs from the dev image for even 1 hour of the run (till the error starts appearing and stays for 5-10 mins), that would be helpful. |
We did indeed have to change We are running the I will leave the test running until it encounters an issue. Once it does, I will post back here with a snapshot of the logs. |
Yup, its due to logging for sure nothing else changed. It could be that the additional logging is masking the issue. I can probably move the logging to a file which should be faster and share an updated image. |
I have updated the |
Ran the nightly binary for 75 mins on a cluster with 3 nodes serving 3 different groups, loading 40M RDF's using dgraph live and didn't notice anything suspicious. |
We ran the previous
None of the server pods or zero pod restarted but it is unable to insert anymore data (mutates are timing out). We will redeploy with the latest |
See if it happens with the latest |
Looks like there's been no activity on this for a while. If you run with master, do you still see this issue? |
Just pushed #2448 as well, which should further allow decreasing the RAM usage. You can set |
Cluster info:
Observed issue:
Linear increase in memory usage for each server. When a server hits the request limit (30GB) it restarts. This at first recovers fine but around the 3rd spike it fails to restart and gets stuck.
There are two ways the pipeline can be stuck:
These scenarios seem mutually exclusive (if no connection can be made no mutate can be attempted).
Questions:
Example:
Server 0 gets stuck at the 3rd memory spike.
logs_server-0.txt
logs_server-1.txt
logs_server-2.txt
logs_zero.txt
I had some trouble getting the heap profile unfortunately, it was returning a malformed HTTP response. It is probably something our end at fault, will try again.
The text was updated successfully, but these errors were encountered: