Hardened thread pool #1913

abi87 · 2023-08-24T06:34:40Z

Why this should be merged

@darioush introduced goleak to ava-labs/coreth#273.
I applied it to some avalanchego packages and realized we can't currently shutdown the thread pool we use in avalanchego.
This PR fixes the issue and unlock fixing of leaking goroutines (tbd in subsequent PR).

How this works

Hardened thread pool to:

be able to cleanly shut it down
make it able to behave sanely (no panic) when Send/Shutdown are called out of order (e.g. before Start is called)

How this was tested

New UTs (100% package coverage) + CI

joshua-kim · 2023-08-24T14:53:23Z

snow/networking/worker/pool.go

+	return p, nil
+}
+
+func (p *pool) Start() {


Do we expect Start to be called multiple times? I see we're using sync.Once to guard against it.

We can also get rid of this edge-case by leaving the goroutine initialization in NewPool. Instead of having a Shutdown code we can pass in a channel/context into NewPool so the caller can shutdown the pool via context/close channel. I think this will simplify a lot of the Start/Shutdown code.

If we go with (2) I would recommend renaming NewPool to StartWorkers or something to make it clear that some goroutines are spinning up.

Ignore this comment in favor of using errgroup

abi87 · 2023-08-24T16:44:04Z

snow/networking/worker/pool.go

+	workersCount int
+	requests     chan Request

+	shutdown     bool


used to make Send a no-op after Shutdown is called

…into thread_pool_rework

StephenButtolph · 2023-08-24T18:37:09Z

snow/networking/worker/pool.go

+	if p.noMoreSends.Get() {
+		return
+	}
+


I don't follow what this is needed for. The request queue isn't buffered... so I don't really see what we are gaining out of having this flag at all.

The idea is to avoid requests being executed after shutdown is called. Indeed this is guaranteed that once quit channel is closed and selected, there won't be any worker listening anymore. Dropped the flag

StephenButtolph · 2023-08-24T18:39:19Z

snow/networking/worker/pool_test.go

+	"github.com/stretchr/testify/require"
+)
+
+func TestPoolHandlesRequests(_ *testing.T) {


What is this testing?

Inserted booleans to check request has been executed and testing them.

StephenButtolph · 2023-08-24T18:41:01Z

snow/networking/worker/pool_test.go

+
+	// late requests, after Shutdown, are no-ops that won't panic
+	lateRequest := func() {
+		time.Sleep(time.Minute)


Why is this sleeping for a minute?

The idea is to show that the request is not executed at all. If it was, one minute sleep is enough to make the test be terminated. In hindsight it's better to signal job done with a boolean and check that.

joshua-kim · 2023-08-24T19:25:20Z

I chatted w/ Alberto offline but I think we can replace most of the code in the existing worker.Pool with an errgroup.Group, since SetLimit caps the amount of goroutines (ref). The semantics of our Send and errgroup.Go are the same, where both block until a worker goroutine is free to accept the task so it should work as a drop-in replacement to the worker loops we manage.

abi87 · 2023-08-25T07:22:40Z

I chatted w/ Alberto offline but I think we can replace most of the code in the existing worker.Pool with an errgroup.Group, since SetLimit caps the amount of goroutines (ref). The semantics of our Send and errgroup.Go are the same, where both block until a worker goroutine is free to accept the task so it should work as a drop-in replacement to the worker loops we manage.

@joshua-kim, thanks for the input! I tried it out and I think there is a difference in the way worker.Pool and errgroup.Group terminate that makes them non equivalent for our use.
We'd like worker.Pool to stop accepting requests after shutdown. IIUC, errgroup.Group has not such a feature built in.
What I think I could do is replace worker.Pool internal goroutines with an errgroup.Group and use the quit channel to ensure Shutdown.
But maybe it's just simpler to keep the goroutines?

joshua-kim · 2023-08-25T12:15:18Z

We'd like worker.Pool to stop accepting requests after shutdown. IIUC, errgroup.Group has not such a feature built in.

Yeah we can use an atomic bool to indicate the worker pool is closed

snow/networking/worker/pool_test.go

Co-authored-by: Stephen Buttolph <stephen@avalabs.org> Signed-off-by: Alberto Benegiamo <alberto.benegiamo@gmail.com>

joshua-kim · 2023-08-25T15:33:50Z

snow/networking/worker/pool.go

 	// Send the request to the worker pool.
 	//
-	// Send should never be called after [Shutdown] is called.
+	// Send can be safely called after [Shutdown] and it'll be no-op.


Suggested change

// Send can be safely called after [Shutdown] and it'll be no-op.

// Send is a no-op if [Shutdown] has been called.

joshua-kim · 2023-08-25T15:36:51Z

snow/networking/worker/pool.go

+	// [shutdownOnce] ensures Shutdown idempotency
 	shutdownOnce sync.Once
-	shutdownWG   sync.WaitGroup
+
+	// [shutdownWG] makes sure all workers have stopped before Shutdown returns
+	shutdownWG sync.WaitGroup
+
+	// closing [quit] tells the workers to stop working


nit: leaving this to your preference but I personally don't write comments on variables that are self-documenting. I think these are all being used in pretty idiomatic ways so I don't think these are necessary to have. Feel free to ignore this comment.

joshua-kim · 2023-08-25T15:37:06Z

snow/networking/worker/pool.go

+		p.shutdownWG.Add(1)
 		go p.runWorker()
 	}
+


nit: undo diff

joshua-kim · 2023-08-25T15:37:33Z

snow/networking/worker/pool.go

+	for {
+		select {
+		case <-p.quit:
+			return // stop worker


nit: this comment isn't necessary

indeed, dropped

joshua-kim · 2023-08-25T15:38:23Z

snow/networking/worker/pool.go

+		case <-p.quit:
+			return // stop worker
+		case request := <-p.requests:
+			if request != nil {


If someone handles us a nil request I feel like it's okay to just fail obviously and die instead of silently dropping a nil request somewhere. I think we can remove this check?

removed the test. It wasn't there indeed, so it should be fine

joshua-kim · 2023-08-25T15:41:47Z

snow/networking/worker/pool.go

+		// We don't close requests channel to avoid panics
+		// upon sending request over a closed channel.


nit: move this comment above the close line. I personally would probably not comment this because i feel like this line is self-documenting but leaving this inclusion of this to your preference

We also no longer close p.requests here, is that okay?

if we close p.request we risk a panic if another goroutine invoke Send. I tried to express this in the comment

StephenButtolph

Should have caught this on the first pass... sorry

StephenButtolph · 2023-08-28T17:08:19Z

snow/networking/handler/handler.go

 		// we check the value of [h.closing] after the call to [Signal].
 		h.syncMessageQueue.Shutdown()
 		h.asyncMessageQueue.Shutdown()
+		h.asyncMessagePool.Shutdown()


This can cause the chain router's shutdown to hang indefinitely on a chain that isn't shutting down correctly... (Stop should never block. Can we add that to the comment on Stop?)

Actually looking at this which a fresh set of eyes... I'm not sure the existing code is incorrect at all. It seems like the goroutines are shutdown correctly - and the handler exposes the ability to block on the goroutine shutdown by calling AwaitStopped.

fwiw - I still think the pool changes are good... But I don't think we should be calling shutdown here.

abi87 · 2023-08-29T10:28:02Z

snow/networking/handler/handler.go

 	defer h.ctx.Lock.Unlock()

+	// h.asyncMessagePool must be started before async dispatch
+	h.asyncMessagePool.Start()


@joshua-kim, @StephenButtolph, I ended up re-introducing the start method.I believe I need it to avoid some goroutines leak in some UTs.
In production code we always call pool.Start after its creaton in snow/networking/handler/handler.go. However there are some UTs where we don't want to call handler.Start since we want to inspect handler queues with Len() (see TestRouterCrossChainMessages).
TestRouterCrossChainMessages woud leak the pool goroutines if those are started upon NewPool().
Launching goroutines in handler.Start solves the issue: goroutines are never started, since handler.Start is not called, hence no leak. You can see the fully fixed TestRouterCrossChainMessages in my next PR

abi87 · 2023-08-29T17:07:34Z

Closing this PR in favor of #1940

hardened thread pool

cbba9bc

abi87 requested review from danlaine and StephenButtolph as code owners August 24, 2023 06:34

abi87 self-assigned this Aug 24, 2023

abi87 added 4 commits August 24, 2023 08:43

nits

eab2171

fixed data race

64ee39d

improved test

03c63f4

nits

35c23ad

abi87 changed the title ~~hardened thread pool~~ Hardened thread pool Aug 24, 2023

abi87 requested review from joshua-kim and ceyonur August 24, 2023 09:24

abi87 added the cleanup Code quality improvement label Aug 24, 2023

joshua-kim reviewed Aug 24, 2023

View reviewed changes

abi87 added 2 commits August 24, 2023 18:42

thread pool simplification

c97d15a

Merge branch 'dev' into thread_pool_rework

b0bd31b

abi87 commented Aug 24, 2023

View reviewed changes

abi87 added 5 commits August 24, 2023 19:08

some more pool cleanup and test improvements

04e79e7

appease linter

fc488e3

Merge branch 'dev' into thread_pool_rework

4ea8fc5

fix data race

61cb4f7

Merge branch 'thread_pool_rework' of github.com:ava-labs/avalanchego …

c3af91a

…into thread_pool_rework

abi87 requested a review from joshua-kim August 24, 2023 17:44

StephenButtolph reviewed Aug 24, 2023

View reviewed changes

Merge branch 'dev' into thread_pool_rework

3e8319b

test cleanup + shutdown flag removed

e403391

Merge branch 'dev' into thread_pool_rework

c835a59

abi87 requested a review from StephenButtolph August 25, 2023 07:35

StephenButtolph approved these changes Aug 25, 2023

View reviewed changes

snow/networking/worker/pool_test.go Outdated Show resolved Hide resolved

Update snow/networking/worker/pool_test.go

e14ebf6

Co-authored-by: Stephen Buttolph <stephen@avalabs.org> Signed-off-by: Alberto Benegiamo <alberto.benegiamo@gmail.com>

StephenButtolph added this to the v1.10.10 milestone Aug 25, 2023

appease linter

4026cbe

joshua-kim reviewed Aug 25, 2023

View reviewed changes

nits

f3c9ef8

abi87 requested a review from joshua-kim August 25, 2023 16:22

joshua-kim approved these changes Aug 28, 2023

View reviewed changes

Merge branch 'dev' into thread_pool_rework

936021d

StephenButtolph requested changes Aug 28, 2023

View reviewed changes

abi87 added 3 commits August 29, 2023 09:19

Merge branch 'dev' into thread_pool_rework

d44d164

fixed wrongly placed thread pool shutdown

eb91dff

reintroduced pool workers start

bcd6382

abi87 commented Aug 29, 2023

View reviewed changes

abi87 requested review from joshua-kim and StephenButtolph August 29, 2023 10:41

abi87 closed this Aug 29, 2023

StephenButtolph deleted the thread_pool_rework branch July 24, 2024 20:47

	// Send can be safely called after [Shutdown] and it'll be no-op.
	// Send is a no-op if [Shutdown] has been called.

		// We don't close requests channel to avoid panics
		// upon sending request over a closed channel.

Hardened thread pool #1913

Hardened thread pool #1913

Uh oh!

Conversation

abi87 commented Aug 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why this should be merged

How this works

How this was tested

Uh oh!

joshua-kim Aug 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abi87 Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joshua-kim commented Aug 24, 2023

Uh oh!

abi87 commented Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joshua-kim commented Aug 25, 2023

Uh oh!

Uh oh!

joshua-kim Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StephenButtolph left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abi87 commented Aug 29, 2023

Uh oh!

Uh oh!

abi87 commented Aug 24, 2023 •

edited

Loading

joshua-kim Aug 24, 2023 •

edited

Loading

abi87 Aug 25, 2023 •

edited

Loading

abi87 commented Aug 25, 2023 •

edited

Loading

joshua-kim Aug 25, 2023 •

edited

Loading