Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hanging server issue via gRPC #2231

Closed
emhagman opened this issue Mar 14, 2018 · 13 comments
Closed

Hanging server issue via gRPC #2231

emhagman opened this issue Mar 14, 2018 · 13 comments
Labels
kind/question Something requiring a response.
Milestone

Comments

@emhagman
Copy link

emhagman commented Mar 14, 2018

Sometimes my node will fail without any error and become unresponsive via gRPC. Restarting seems to fix it. I have 3 replica nodes with one "zero" instance. When querying via Ratel using HTTP the query succeeds.

  • What version of Dgraph are you using? 1.04
  • Have you tried reproducing the issue with latest release? Yes
  • What is the hardware spec (RAM, OS)? Ubuntu 16.04, Docker, 32GB Ram
  • Steps to reproduce the issue (command/config used to run Dgraph). Not sure
  • Expected behaviour and actual result. Not to stall

Logs here:
https://drive.google.com/file/d/1lYDTgMyATApiSVRV8rRsDwRLYNNg9tCW/view?usp=sharing

@pawanrawal pawanrawal self-assigned this Mar 14, 2018
@pawanrawal pawanrawal added the investigate Requires further investigation label Mar 14, 2018
@pawanrawal
Copy link
Contributor

Observing the logs I see a large number of pending RAFT entries when a server restarts. Can you please upgrade to v1.0.4. What client do you use to load your data? I suppose all the three Dgraph servers are running on the same physical machine?

server2_1  | 2018-03-14T21:47:47.878081577Z 2018/03/14 21:47:47 node.go:258: Group 1 found 272374 entries

Are all the restarts manual? I have a hunch that since a snapshot isn't happening, memory grows until a container is getting killed by Docker and restarted.

@emhagman
Copy link
Author

emhagman commented Mar 15, 2018

@pawanrawal Will do. I used dgraph:latest mistakenly and Docker was caching the 1.0.3 image. I am using the dgraph-js@1.1.1 client. I am manually restarting the server when it stops responding, yes. Is there a reason why the snapshot is not occuring?

We are using the following command to load the RDF file:
dgraph bulk -r drafted.rdf -s drafted.schema --map_shards=2 --reduce_shards=1 --zero=zero:3082

@emhagman
Copy link
Author

emhagman commented Mar 15, 2018

AHHH!!!

server2_1  | 2018/03/15 02:25:43 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1 2 3] XXX_unrecognized:[]} Index:1229 Term:2 XXX_unrecognized:[]}, len(data): 25
server_1   | 2018/03/15 02:25:45 draft.go:669: Writing snapshot at index: 31456, applied mark: 31389
server_1   | 2018/03/15 02:25:45 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1 2 3] XXX_unrecognized:[]} Index:31456 Term:2 XXX_unrecognized:[]}, len(data): 24
server3_1  | 2018/03/15 02:25:46 draft.go:669: Writing snapshot at index: 31456, applied mark: 31389
server3_1  | 2018/03/15 02:25:46 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1 2 3] XXX_unrecognized:[]} Index:31456 Term:2 XXX_unrecognized:[]}, len(data): 25
zero_1     | 2018/03/15 02:25:49 oracle.go:85: purging below ts:44255, len(o.commits):129652, len(o.aborts):3969, len(o.rowCommit):0

I assume this is good news?

@pawanrawal
Copy link
Contributor

Snapshot not occurring could happen if there are pending transactions which were neither committed/aborted by the client. We added a mechanism on the server to abort such pending old transactions in v1.0.4.

@emhagman
Copy link
Author

emhagman commented Mar 15, 2018

Gotcha, makes sense.

server_1   | 2018/03/15 02:26:15 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server3_1  | 2018/03/15 02:26:16 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server2_1  | 2018/03/15 02:26:43 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server_1   | 2018/03/15 02:26:45 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server3_1  | 2018/03/15 02:26:46 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server2_1  | 2018/03/15 02:27:13 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server_1   | 2018/03/15 02:27:15 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server3_1  | 2018/03/15 02:27:16 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server2_1  | 2018/03/15 02:27:43 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server_1   | 2018/03/15 02:27:45 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server3_1  | 2018/03/15 02:27:46 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server2_1  | 2018/03/15 02:28:13 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server_1   | 2018/03/15 02:28:15 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server3_1  | 2018/03/15 02:28:16 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server2_1  | 2018/03/15 02:28:43 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]

When upgrading to 1.0.4, I am seeing that a lot.

@emhagman
Copy link
Author

@pawanrawal I rebuilt my graph. I finally got the snapshotting to work. I waited a while and made several queries to the "leader" before starting up the second and third replica server and now it is copying snapshots between servers.

My RDF file is close to 800MB. Is there a way to tell when a server is ready to be copied from? It seems if I start the other replicas too early, it doesn't copy the snapshot over at all and instead creates a new p directory.

@pawanrawal
Copy link
Contributor

That sounds like a bug, I would expect it to copy over the snapshots to the follower if its already the leader. Feel free to file another issue mentioning this and we will get it checked and resolved.

@emhagman
Copy link
Author

emhagman commented Mar 16, 2018

Filed another issue #2236 . Any progress on this issue? I updated our server to 64GB to try to prevent OOM from the snapshots not copying just in case this happens again for now.

@pawanrawal
Copy link
Contributor

pawanrawal commented Mar 19, 2018

The snapshotting issue shouldn't occur in master. Try with dgraph/dgraph:master and see if the server becoming unresponsive still happens.

The slowness could be due to the fact that Docker is trying to use swap. Do you have any limits on the memory used by the container in your docker-compose config?


server2_1 | 2018/03/15 02:28:43 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]

What is weird here is that applied watermark is less than txn watermark, which shouldn't happen. I couldn't reproduce it on the directory shared by you. If it still happens for you, can you share the steps to reproduce or directories from other servers as well?

@pawanrawal
Copy link
Contributor

@emhagman /ping

@manishrjain manishrjain added investigate Requires further investigation kind/question Something requiring a response. labels Mar 22, 2018
@emhagman
Copy link
Author

emhagman commented Mar 29, 2018

@pawanrawal I am getting this issue again even after the graphs synced properly via snapshots in the beginning. I have more data of a new instance if you want to read it.

Logs example:

server3_1  | 2018/03/29 17:17:53 draft.go:653: Couldn't take snapshot, txn watermark: [1621775], applied watermark: [1630869]
server3_1  | 2018/03/29 17:18:23 draft.go:653: Couldn't take snapshot, txn watermark: [1621775], applied watermark: [1630869]
server3_1  | 2018/03/29 17:18:53 draft.go:653: Couldn't take snapshot, txn watermark: [1621775], applied watermark: [1630869]
server3_1  | 2018/03/29 17:19:23 draft.go:653: Couldn't take snapshot, txn watermark: [1621775], applied watermark: [1630869]
server2_1  | 2018/03/29 17:19:29 draft.go:653: Couldn't take snapshot, txn watermark: [1621775], applied watermark: [1630869]
server_1   | 2018/03/29 17:19:32 draft.go:653: Couldn't take snapshot, txn watermark: [208217], applied watermark: [208209]
server3_1  | 2018/03/29 17:19:53 draft.go:653: Couldn't take snapshot, txn watermark: [1621775], applied watermark: [1630869]
server2_1  | 2018/03/29 17:19:59 draft.go:653: Couldn't take snapshot, txn watermark: [1621775], applied watermark: [1630869]
server_1   | 2018/03/29 17:20:02 draft.go:653: Couldn't take snapshot, txn watermark: [208217], applied watermark: [208209]
server3_1  | 2018/03/29 17:20:23 draft.go:653: Couldn't take snapshot, txn watermark: [1621775], applied watermark: [1630869]
server2_1  | 2018/03/29 17:20:29 draft.go:653: Couldn't take snapshot, txn watermark: [1621775], applied watermark: [1630869]
server_1   | 2018/03/29 17:20:32 draft.go:653: Couldn't take snapshot, txn watermark: [208217], applied watermark: [208209]
server3_1  | 2018/03/29 17:20:53 draft.go:653: Couldn't take snapshot, txn watermark: [1621775], applied watermark: [1630869]
server2_1  | 2018/03/29 17:20:59 draft.go:653: Couldn't take snapshot, txn watermark: [1621775], applied watermark: [1630869]
server_1   | 2018/03/29 17:21:02 draft.go:653: Couldn't take snapshot, txn watermark: [208217], applied watermark: [208209]

I upped our memory to 64GB and the server has reached 46GB now. At some point it's going to be unusable... I am now starting to get write timeouts or just writes that take forever.

@manishrjain manishrjain added this to the Sprint-001 milestone Apr 3, 2018
@pawanrawal
Copy link
Contributor

pawanrawal commented Apr 3, 2018

Hey @emhagman, can you try upgrading to the nightly? Another issue (#2266) was fixed with regards to snapshot with replicas and I don't think its part of the binary that you have. I can confirm if you share the output of dgraph version.

@manishrjain manishrjain removed the investigate Requires further investigation label Apr 9, 2018
@manishrjain
Copy link
Contributor

Closing due to no activity. Feel free to create a new issue if you find a related bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Something requiring a response.
Development

No branches or pull requests

3 participants