Hanging server issue via gRPC #2231

emhagman · 2018-03-14T22:38:01Z

Sometimes my node will fail without any error and become unresponsive via gRPC. Restarting seems to fix it. I have 3 replica nodes with one "zero" instance. When querying via Ratel using HTTP the query succeeds.

What version of Dgraph are you using? 1.04
Have you tried reproducing the issue with latest release? Yes
What is the hardware spec (RAM, OS)? Ubuntu 16.04, Docker, 32GB Ram
Steps to reproduce the issue (command/config used to run Dgraph). Not sure
Expected behaviour and actual result. Not to stall

Logs here:
https://drive.google.com/file/d/1lYDTgMyATApiSVRV8rRsDwRLYNNg9tCW/view?usp=sharing

pawanrawal · 2018-03-15T01:31:30Z

Observing the logs I see a large number of pending RAFT entries when a server restarts. Can you please upgrade to v1.0.4. What client do you use to load your data? I suppose all the three Dgraph servers are running on the same physical machine?

server2_1  | 2018-03-14T21:47:47.878081577Z 2018/03/14 21:47:47 node.go:258: Group 1 found 272374 entries

Are all the restarts manual? I have a hunch that since a snapshot isn't happening, memory grows until a container is getting killed by Docker and restarted.

emhagman · 2018-03-15T02:22:22Z

@pawanrawal Will do. I used dgraph:latest mistakenly and Docker was caching the 1.0.3 image. I am using the dgraph-js@1.1.1 client. I am manually restarting the server when it stops responding, yes. Is there a reason why the snapshot is not occuring?

We are using the following command to load the RDF file:
dgraph bulk -r drafted.rdf -s drafted.schema --map_shards=2 --reduce_shards=1 --zero=zero:3082

emhagman · 2018-03-15T02:26:11Z

AHHH!!!

server2_1  | 2018/03/15 02:25:43 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1 2 3] XXX_unrecognized:[]} Index:1229 Term:2 XXX_unrecognized:[]}, len(data): 25
server_1   | 2018/03/15 02:25:45 draft.go:669: Writing snapshot at index: 31456, applied mark: 31389
server_1   | 2018/03/15 02:25:45 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1 2 3] XXX_unrecognized:[]} Index:31456 Term:2 XXX_unrecognized:[]}, len(data): 24
server3_1  | 2018/03/15 02:25:46 draft.go:669: Writing snapshot at index: 31456, applied mark: 31389
server3_1  | 2018/03/15 02:25:46 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1 2 3] XXX_unrecognized:[]} Index:31456 Term:2 XXX_unrecognized:[]}, len(data): 25
zero_1     | 2018/03/15 02:25:49 oracle.go:85: purging below ts:44255, len(o.commits):129652, len(o.aborts):3969, len(o.rowCommit):0

I assume this is good news?

pawanrawal · 2018-03-15T02:27:21Z

Snapshot not occurring could happen if there are pending transactions which were neither committed/aborted by the client. We added a mechanism on the server to abort such pending old transactions in v1.0.4.

emhagman · 2018-03-15T02:29:46Z

Gotcha, makes sense.

server_1   | 2018/03/15 02:26:15 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server3_1  | 2018/03/15 02:26:16 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server2_1  | 2018/03/15 02:26:43 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server_1   | 2018/03/15 02:26:45 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server3_1  | 2018/03/15 02:26:46 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server2_1  | 2018/03/15 02:27:13 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server_1   | 2018/03/15 02:27:15 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server3_1  | 2018/03/15 02:27:16 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server2_1  | 2018/03/15 02:27:43 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server_1   | 2018/03/15 02:27:45 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server3_1  | 2018/03/15 02:27:46 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server2_1  | 2018/03/15 02:28:13 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server_1   | 2018/03/15 02:28:15 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server3_1  | 2018/03/15 02:28:16 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]
server2_1  | 2018/03/15 02:28:43 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]

When upgrading to 1.0.4, I am seeing that a lot.

emhagman · 2018-03-16T05:18:49Z

@pawanrawal I rebuilt my graph. I finally got the snapshotting to work. I waited a while and made several queries to the "leader" before starting up the second and third replica server and now it is copying snapshots between servers.

My RDF file is close to 800MB. Is there a way to tell when a server is ready to be copied from? It seems if I start the other replicas too early, it doesn't copy the snapshot over at all and instead creates a new p directory.

pawanrawal · 2018-03-16T05:45:46Z

That sounds like a bug, I would expect it to copy over the snapshots to the follower if its already the leader. Feel free to file another issue mentioning this and we will get it checked and resolved.

emhagman · 2018-03-16T19:04:03Z

Filed another issue #2236 . Any progress on this issue? I updated our server to 64GB to try to prevent OOM from the snapshots not copying just in case this happens again for now.

pawanrawal · 2018-03-19T05:56:45Z

The snapshotting issue shouldn't occur in master. Try with dgraph/dgraph:master and see if the server becoming unresponsive still happens.

The slowness could be due to the fact that Docker is trying to use swap. Do you have any limits on the memory used by the container in your docker-compose config?

server2_1 | 2018/03/15 02:28:43 draft.go:653: Couldn't take snapshot, txn watermark: [31466], applied watermark: [31389]

What is weird here is that applied watermark is less than txn watermark, which shouldn't happen. I couldn't reproduce it on the directory shared by you. If it still happens for you, can you share the steps to reproduce or directories from other servers as well?

pawanrawal · 2018-03-22T04:14:39Z

@emhagman /ping

emhagman · 2018-03-29T17:21:23Z

@pawanrawal I am getting this issue again even after the graphs synced properly via snapshots in the beginning. I have more data of a new instance if you want to read it.

Logs example:

server3_1  | 2018/03/29 17:17:53 draft.go:653: Couldn't take snapshot, txn watermark: [1621775], applied watermark: [1630869]
server3_1  | 2018/03/29 17:18:23 draft.go:653: Couldn't take snapshot, txn watermark: [1621775], applied watermark: [1630869]
server3_1  | 2018/03/29 17:18:53 draft.go:653: Couldn't take snapshot, txn watermark: [1621775], applied watermark: [1630869]
server3_1  | 2018/03/29 17:19:23 draft.go:653: Couldn't take snapshot, txn watermark: [1621775], applied watermark: [1630869]
server2_1  | 2018/03/29 17:19:29 draft.go:653: Couldn't take snapshot, txn watermark: [1621775], applied watermark: [1630869]
server_1   | 2018/03/29 17:19:32 draft.go:653: Couldn't take snapshot, txn watermark: [208217], applied watermark: [208209]
server3_1  | 2018/03/29 17:19:53 draft.go:653: Couldn't take snapshot, txn watermark: [1621775], applied watermark: [1630869]
server2_1  | 2018/03/29 17:19:59 draft.go:653: Couldn't take snapshot, txn watermark: [1621775], applied watermark: [1630869]
server_1   | 2018/03/29 17:20:02 draft.go:653: Couldn't take snapshot, txn watermark: [208217], applied watermark: [208209]
server3_1  | 2018/03/29 17:20:23 draft.go:653: Couldn't take snapshot, txn watermark: [1621775], applied watermark: [1630869]
server2_1  | 2018/03/29 17:20:29 draft.go:653: Couldn't take snapshot, txn watermark: [1621775], applied watermark: [1630869]
server_1   | 2018/03/29 17:20:32 draft.go:653: Couldn't take snapshot, txn watermark: [208217], applied watermark: [208209]
server3_1  | 2018/03/29 17:20:53 draft.go:653: Couldn't take snapshot, txn watermark: [1621775], applied watermark: [1630869]
server2_1  | 2018/03/29 17:20:59 draft.go:653: Couldn't take snapshot, txn watermark: [1621775], applied watermark: [1630869]
server_1   | 2018/03/29 17:21:02 draft.go:653: Couldn't take snapshot, txn watermark: [208217], applied watermark: [208209]

I upped our memory to 64GB and the server has reached 46GB now. At some point it's going to be unusable... I am now starting to get write timeouts or just writes that take forever.

pawanrawal · 2018-04-03T07:25:13Z

Hey @emhagman, can you try upgrading to the nightly? Another issue (#2266) was fixed with regards to snapshot with replicas and I don't think its part of the binary that you have. I can confirm if you share the output of dgraph version.

manishrjain · 2018-06-14T02:02:33Z

Closing due to no activity. Feel free to create a new issue if you find a related bug.

pawanrawal self-assigned this Mar 14, 2018

pawanrawal added the investigate Requires further investigation label Mar 14, 2018

manishrjain added investigate Requires further investigation kind/question Something requiring a response. labels Mar 22, 2018

manishrjain added this to the Sprint-001 milestone Apr 3, 2018

manishrjain removed the investigate Requires further investigation label Apr 9, 2018

manishrjain closed this as completed Jun 14, 2018

manishrjain unassigned pawanrawal Jun 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hanging server issue via gRPC #2231

Hanging server issue via gRPC #2231

emhagman commented Mar 14, 2018 •

edited by dgraph-bot

Loading

pawanrawal commented Mar 15, 2018

emhagman commented Mar 15, 2018 •

edited

Loading

emhagman commented Mar 15, 2018 •

edited

Loading

pawanrawal commented Mar 15, 2018

emhagman commented Mar 15, 2018 •

edited

Loading

emhagman commented Mar 16, 2018

pawanrawal commented Mar 16, 2018

emhagman commented Mar 16, 2018 •

edited

Loading

pawanrawal commented Mar 19, 2018 •

edited

Loading

pawanrawal commented Mar 22, 2018

emhagman commented Mar 29, 2018 •

edited

Loading

pawanrawal commented Apr 3, 2018 •

edited

Loading

manishrjain commented Jun 14, 2018

Hanging server issue via gRPC #2231

Hanging server issue via gRPC #2231

Comments

emhagman commented Mar 14, 2018 • edited by dgraph-bot Loading

pawanrawal commented Mar 15, 2018

emhagman commented Mar 15, 2018 • edited Loading

emhagman commented Mar 15, 2018 • edited Loading

pawanrawal commented Mar 15, 2018

emhagman commented Mar 15, 2018 • edited Loading

emhagman commented Mar 16, 2018

pawanrawal commented Mar 16, 2018

emhagman commented Mar 16, 2018 • edited Loading

pawanrawal commented Mar 19, 2018 • edited Loading

pawanrawal commented Mar 22, 2018

emhagman commented Mar 29, 2018 • edited Loading

pawanrawal commented Apr 3, 2018 • edited Loading

manishrjain commented Jun 14, 2018

emhagman commented Mar 14, 2018 •

edited by dgraph-bot

Loading

emhagman commented Mar 15, 2018 •

edited

Loading

emhagman commented Mar 15, 2018 •

edited

Loading

emhagman commented Mar 15, 2018 •

edited

Loading

emhagman commented Mar 16, 2018 •

edited

Loading

pawanrawal commented Mar 19, 2018 •

edited

Loading

emhagman commented Mar 29, 2018 •

edited

Loading

pawanrawal commented Apr 3, 2018 •

edited

Loading