Bug Fix: Avoid Dgraph cluster getting stuck in infinite leader election #3391

manishrjain · 2019-05-09T02:21:13Z

Dgraph Alphas were calculating snapshots and checkpoints in the main Raft loop, which depending upon disk speed caused Ticks to not be done for seconds. This caused the followers to assume that the leader is unavailable, triggering an election. Because the checkpoint and snapshot calculation happens every 30s, the election was happening every 30s as well.

This PR moves both of those outside the main loop into their own goroutine (colocated with the code which shuts down Raft node). Tested successfully with a live cluster which was exhibiting these symptoms.

This PR also tracks how many heartbeats have come in and gone out from each node and prints them out under V(3). Useful for debugging.

The PR improves upon and uses x.Timer to track Raft.Ready components' latencies and report them in both Alphas and Zeros. This fixes the incorrect statement we were making about disk latency being the primary cause of Raft.Ready being slow.

This change is

…p. Capture the latency of individual components in Raft.Ready better.

…slow ticker.

danielmai · 2019-05-09T02:36:31Z

conn/node.go, line 253 at r1 (raw file):

		switch msg.Type {
		case raftpb.MsgHeartbeat, raftpb.MsgHeartbeatResp:
			atomic.AddInt64(&n.heartbeatsOut, 1)

This happens in glog.V(2) but the ticker in ReportRaftComms only zeroes out heartbeats in glog.V(3). Is that OK?

danielmai · 2019-05-09T02:37:37Z

conn/raft_server.go, line 232 at r1 (raw file):

				switch msg.Type {
				case raftpb.MsgHeartbeat, raftpb.MsgHeartbeatResp:
					atomic.AddInt64(&n.heartbeatsIn, 1)

Same question here too about incrementing in V(2) and glogging/zero-ing in V(3).

manishrjain

Reviewable status: 0 of 5 files reviewed, 2 unresolved discussions (waiting on @danielmai and @manishrjain)

conn/node.go, line 253 at r1 (raw file):

Previously, danielmai (Daniel Mai) wrote…

This happens in glog.V(2) but the ticker in ReportRaftComms only zeroes out heartbeats in glog.V(3). Is that OK?

Yeah, that's alright. Even if uint64 overflows (??) and goes back to zero, it's not a big deal.

danielmai

Reviewed 5 of 5 files at r1.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @danielmai and @manishrjain)

danielmai

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @manishrjain)

conn/node.go, line 253 at r1 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Yeah, that's alright. Even if uint64 overflows (??) and goes back to zero, it's not a big deal.

Sounds good.

danielmai

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @manishrjain)

…on (#3391) Dgraph Alphas were calculating snapshots and checkpoints in the main Raft loop, which depending upon disk speed caused Ticks to not be done for seconds. This caused the followers to assume that the leader is unavailable, triggering an election. Because the checkpoint and snapshot calculation happens every 30s, the election was happening every 30s as well. This PR moves both of those outside the main loop into their own goroutine (colocated with the code which shuts down Raft node). Tested successfully with a live cluster which was exhibiting these symptoms. This PR also tracks how many heartbeats have come in and gone out from each node and prints them out under V(3). Useful for debugging. The PR improves upon and uses x.Timer to track Raft.Ready components' latencies and report them in both Alphas and Zeros. This fixes the incorrect statement we were making about disk latency being the primary cause of Raft.Ready being slow. Changes: * Report Heartbeat comms * Add logs around heartbeats. * Move snapshot and checkpoint calculation outside of the main Raft loop. Capture the latency of individual components in Raft.Ready better. * Add timer to Zero as well. Fix a bug: Use a for loop when going over slow ticker. * Move num pending txns to V(2). * Move the checkpointing code outside of the Run func.

mangalaman93 · 2019-05-09T02:56:25Z

conn/node.go

@@ -72,6 +72,9 @@ type Node struct {
 	// The stages are proposed -> committed (accepted by cluster) ->
 	// applied (to PL) -> synced (to BadgerDB).
 	Applied y.WaterMark
+
+	heartbeatsOut int64


atomic package issues here? This is 64 bit. https://golang.org/pkg/sync/atomic/#pkg-note-BUG

mangalaman93 · 2019-05-09T02:56:31Z

conn/node.go

@@ -155,6 +158,20 @@ func NewNode(rc *pb.RaftContext, store *raftwal.DiskStorage) *Node {
 	return n
 }

+func (n *Node) ReportRaftComms() {
+	if !glog.V(3) {


Shouldn't this work for level > 3 too?

mangalaman93 · 2019-05-09T02:57:17Z

worker/draft.go

@@ -964,14 +983,13 @@ func (n *node) Run() {
 				span.End()
 				ostats.RecordWithTags(context.Background(),
 					[]tag.Mutator{tag.Upsert(x.KeyMethod, "alpha.RunLoop")},
-					x.LatencyMs.M(x.SinceMs(start)))
+					x.LatencyMs.M(float64(timer.Total())/1e6))


could divide by time.MilliSecond instead here

mangalaman93 · 2019-05-09T02:57:36Z

x/x.go

-func (t *Timer) All() []time.Duration {
-	return t.records
+func (t *Timer) String() string {
+	sort.Slice(t.records, func(i, j int) bool {


Is sorting helpful in viewing the results?

…on (dgraph-io#3391) Dgraph Alphas were calculating snapshots and checkpoints in the main Raft loop, which depending upon disk speed caused Ticks to not be done for seconds. This caused the followers to assume that the leader is unavailable, triggering an election. Because the checkpoint and snapshot calculation happens every 30s, the election was happening every 30s as well. This PR moves both of those outside the main loop into their own goroutine (colocated with the code which shuts down Raft node). Tested successfully with a live cluster which was exhibiting these symptoms. This PR also tracks how many heartbeats have come in and gone out from each node and prints them out under V(3). Useful for debugging. The PR improves upon and uses x.Timer to track Raft.Ready components' latencies and report them in both Alphas and Zeros. This fixes the incorrect statement we were making about disk latency being the primary cause of Raft.Ready being slow. Changes: * Report Heartbeat comms * Add logs around heartbeats. * Move snapshot and checkpoint calculation outside of the main Raft loop. Capture the latency of individual components in Raft.Ready better. * Add timer to Zero as well. Fix a bug: Use a for loop when going over slow ticker. * Move num pending txns to V(2). * Move the checkpointing code outside of the Run func.

manishrjain added 5 commits May 8, 2019 16:36

Report Heartbeat comms

d86a310

Add logs around heartbeats.

6be5733

Move snapshot and checkpoint calculation outside of the main Raft loo…

e2514e4

…p. Capture the latency of individual components in Raft.Ready better.

Add timer to Zero as well. Fix a bug: Use a for loop when going over …

cd1f778

…slow ticker.

Move num pending txns to V(2).

fb45daa

manishrjain requested a review from a team as a code owner May 9, 2019 02:21

Move the checkpointing code outside of the Run func.

f5f170e

manishrjain commented May 9, 2019

View reviewed changes

danielmai approved these changes May 9, 2019

View reviewed changes

manishrjain merged commit bfcf784 into master May 9, 2019

manishrjain deleted the mrjn/raft-comms branch May 9, 2019 02:44

mangalaman93 reviewed May 9, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Fix: Avoid Dgraph cluster getting stuck in infinite leader election #3391

Bug Fix: Avoid Dgraph cluster getting stuck in infinite leader election #3391

manishrjain commented May 9, 2019 •

edited

Loading

danielmai commented May 9, 2019

danielmai commented May 9, 2019

manishrjain left a comment

danielmai left a comment

danielmai left a comment

danielmai left a comment

mangalaman93 May 9, 2019

mangalaman93 May 9, 2019

mangalaman93 May 9, 2019

mangalaman93 May 9, 2019

Bug Fix: Avoid Dgraph cluster getting stuck in infinite leader election #3391

Bug Fix: Avoid Dgraph cluster getting stuck in infinite leader election #3391

Conversation

manishrjain commented May 9, 2019 • edited Loading

danielmai commented May 9, 2019

danielmai commented May 9, 2019

manishrjain left a comment

Choose a reason for hiding this comment

danielmai left a comment

Choose a reason for hiding this comment

danielmai left a comment

Choose a reason for hiding this comment

danielmai left a comment

Choose a reason for hiding this comment

mangalaman93 May 9, 2019

Choose a reason for hiding this comment

mangalaman93 May 9, 2019

Choose a reason for hiding this comment

mangalaman93 May 9, 2019

Choose a reason for hiding this comment

mangalaman93 May 9, 2019

Choose a reason for hiding this comment

manishrjain commented May 9, 2019 •

edited

Loading