p2p, network, bzzeth: p2p protocol handlers to async by pradovic · Pull Request #2018 · ethersphere/swarm

pradovic · 2019-12-10T14:04:33Z

This is an idea how to run handlers async in p2p.Peer run event loop. This is still in progress so I still did not comment all the code, or refactored it to look better. Main idea was to try to:

Make all handlers execution async
Exit run loop with appropriate error.
Instead of relying on the side peer.Drop. If I am right, this disconnects the peer with the actual produced error instead of the subprotocol with error string from the handler. I am not sure if this is important or desired. If not, we could simplify the code in the run loop a bit.
Separate sync execution needed for handshake and a run loop, as I feel it makes the code more simple. If there is a need for sync run loop, we can easily add this as separate function as well (reusing existing private function in the package)

There is a couple of todo comments that I need to investigate more, I would appreciate any input about those :)

todo:

Write more isolated test to test and document the behavior of Run event loop.
Rethink stream graceful shutdown a bit.
This PR does not cover logging errors and making the actual error more nice. If we agree on this proposal, we can continue to making an error more nice and refactor a code to look a bit better (ex. use context cancellation instead of the running bool + mutex in protocol.Run, etc ...). Also, if we don't think that there is any benefit from 3., I can reduce the code not to support this.
Investigate handleMsgPauser in stream, that is used for tests and see how it feels this kind of loop.
It appears that serverCollectBatch function in stream produces some leaking go routines, should investigate it a bit. It looks like it is happening on a master branch as well, but still, might happen more now.

LMKWYT :)

…ol calls.

pradovic · 2019-12-10T14:05:56Z


 	// expect peer disconnection
-	err = tester.TestDisconnected(&p2ptest.Disconnect{Peer: node.ID(), Error: errors.New("subprotocol error")})
+	err = tester.TestDisconnected(&p2ptest.Disconnect{Peer: node.ID(), Error: errors.New("Message handler error: (msg code 0): unsolicited chunk delivery from peer: cannot find ruid")})


This is an example of difference in disconnect errors now that I mentioned in 3. in the description.

@zelig It appears you were right. The disconnect will eventually happen with subprotocol error anyway. But, if we return the specific error from the run loop, instead of doing a peer.Stop(), the actual error will be broadcasted via peer broadcast (which is how it is checked in the tester.TestDisonnected). I am still not sure if it is useful to us, but it's still probably more clean.

janos

I agree on the approach and would like this PR polished.

Write more isolated test to test and document the behavior of Run event loop.

That would be great.

Rethink stream graceful shutdown a bit.

👍

This PR does not cover logging errors and making the actual error more nice. If we agree on this proposal, we can continue to making an error more nice and refactor a code to look a bit better (ex. use context cancellation instead of the running bool + mutex in protocol.Run, etc ...). Also, if we don't think that there is any benefit from 3., I can reduce the code not to support this.

It would be great to address errors also, if it could be in a followup PR, to make them smaller, and leave this one with only async protocols. I think that there is a lot of room for error improvements and that it may make changes in this PR harder to reviw.

Investigate handleMsgPauser in stream, that is used for tests and see how it feels this kind of loop.

Pauser would be useful if it would be available on protocols level for testing purposes.

It appears that serverCollectBatch function in stream produces some leaking go routines, should investigate it a bit. It looks like it is happening on a master branch as well, but still, might happen more now.

We should investigate this, as serverCollectBatch produces long running goroutines when they are waiting for chunk descriptions from the subscription. They can leak if peer or registry are not closed.

LMKWYT :)

janos · 2019-12-10T17:41:00Z

 	spec      *Spec
 	encode    func(context.Context, interface{}) (interface{}, int, error)
 	decode    func(p2p.Msg) (context.Context, []byte, error)
+	eg        *errgroup.Group // error group used for executing handlers asynchronously


I think that eg can be a value, since Peer is using pointer semantics.

janos · 2019-12-10T17:48:03Z

+	p.running = false
+	p.mtx.Unlock()
+
+	p.eg.Wait()


The comment on error handing inside errgroup with stop should justify not handling error on Wait.

But, if it is not needed to handle the error here, could errgroup be replaced with WaitGroup?

Yup, nice catch! I left it there just in case if we decide to do shutdown with commented code. I am not sure if we are going to need that, depends on the graceful shutdown. I will switch to wait group for now then and bring back eg if needed.

zelig

Brilliant PR thanks

many places logging an error and returning the same error with less context.
Consider keeping the richer context of the log in the error but only return the error
the stream package needs serious cleanup, I put a lot of comments but feel free to keep changes to the minimum and defer sorting it out in subsequent PRs by whoever takes it up

zelig · 2019-12-13T04:30:53Z

 			deliveredCnt++
 			p.logger.Trace("bzzeth.handleNewBlockHeaders", "hash", ch.Address().Hex(), "delivered", deliveredCnt)
+
+			req.lock.RLock()


oops, this was not async beore or you found a bug?

It was async before as well. I am not sure is it really a bug, but it might be potentially dangerous. Basically, if I am right, there is a lock in req that is to be used to update hashes map. This lock is used for update but not for read, which might not be fatal, but still a bit not correct :)

zelig · 2019-12-13T04:32:19Z

-		p.logger.Warn("bzzeth.handleBlockHeaders: nonexisting request id", "id", msg.Rid)
-		p.Drop("nonexisting request id")
-		return
+		p.logger.Warn("", "id", msg.Rid)


log not needed, and certainly not with empty message

zelig · 2019-12-13T04:34:24Z

 	if err != nil {
 		p.logger.Warn("bzzeth.handleBlockHeaders: fatal dropping peer", "id", msg.Rid, "err", err)
-		p.Drop("error on deliverAndStoreAll")
+		return fmt.Errorf("bzzeth.handleBlockHeaders: fatal dropping peer, id: %d err: %w", msg.Rid, err)


how about just

return b.deliverAndStoreAll(ctx, req, headers)

on line 246

zelig · 2019-12-13T04:35:06Z

 	// wait for all validations to get over and close the channels
 	err := wg.Wait()

+	// finish storage is used mostly in testing


so is it needed?

zelig · 2019-12-13T04:36:45Z

 func (h *Hive) handleSubPeersMsg(ctx context.Context, d *Peer, msg *subPeersMsg) error {
 	d.setDepth(msg.Depth)
 	// only send peers after the initial subPeersMsg
+	h.lock.Lock()


so this was not async before?

Correct, it was not, it used to be sync. Basically, the check for sentPeers boolean value should pass only once, and then all other calls should skip this part. Does it makes sense?

zelig · 2019-12-13T06:30:08Z

+// * handles decoding with reflection,
+// * call handlers as callbacks
+func (p *Peer) handleMsg(msg *p2p.Msg, handle func(ctx context.Context, msg interface{}) error) error {
+	// make sure that the payload has been fully consume


zelig · 2019-12-13T06:49:51Z

+			if err != nil {
+				if err != io.EOF {
+					metrics.GetOrRegisterCounter("peer.handleincoming.error", nil).Inc(1)
+					log.Error("peer.handleIncoming", "err", err)


change metrics name and log message according to rename if needed

zelig · 2019-12-13T07:40:52Z

-// * handles decoding with reflection,
-// * call handlers as callbacks
-func (p *Peer) handleIncoming(handle func(ctx context.Context, msg interface{}) error) error {
+// Receive(code) is a sync call that handles incoming message with provided message handler


Some copy-paste mistake 🙈

zelig · 2019-12-13T07:42:02Z

-// * call handlers as callbacks
-func (p *Peer) handleIncoming(handle func(ctx context.Context, msg interface{}) error) error {
+// Receive(code) is a sync call that handles incoming message with provided message handler
+func (p *Peer) Receive(handler func(ctx context.Context, msg interface{}) error) error {


sure we want this exported?

Nope, good catch. We can export it if needed, later

zelig · 2019-12-13T07:43:44Z


 	if msg.Size > p.spec.MaxMsgSize {
-		return errorf(ErrMsgTooLong, "%v > %v", msg.Size, p.spec.MaxMsgSize)
+		err := errorf(ErrMsgTooLong, "%v > %v", msg.Size, p.spec.MaxMsgSize)


why these changes needed?

Oh, sorry, it's probably just my debug leftover 🙈

pradovic · 2019-12-13T12:03:13Z

Thanks @zelig! I agree with both points, I left it for the next PR, as @janos and me wanted to go through errors, especially in stream, after this! I will fix all other suggested changes, and leave error related for the reference for the next PR to avoid to bloat this one too much. I will also add couple of unit tests in this one. Sounds good?

pradovic · 2019-12-17T12:14:29Z

@janos, @zelig I believe I fixed what was suggested in the comments. Some stuff related to the stream, errors and pause is left for the next error-related PRs, should come very soon. Also, during the error pruning, I will investigate a bit more if need the correct error in the node events, and if we don't need it (no need to propagate errors, other then readMsg errors in the run method), then I will switch back to peer drop. Basically, it will just make our run function less complicated, but we will still log correct errors.

pradovic · 2019-12-19T19:00:50Z

Smoke tests seem to pass. @janos thanks for help!

janos

@pradovic if you could address these two minor comments https://github.com/ethersphere/swarm/pull/2018/files#r357484127 https://github.com/ethersphere/swarm/pull/2018/files#r357485691, otherwise, LGTM.

acud · 2020-01-06T03:44:42Z

+	}()
+
+	return func() error {
+		return <-errc


goroutine leak? write into the channel a nil value in the main goroutine when the function returns

Hmmm, the only place where this function is called is in the run method, which is waiting for this channel in order to return. How can it leak? I am not sure wdym 🙈 Maybe it could leak if there was multiple callers waiting, but this is not the use-case for now.

well, if you'll look again you'll see that errc is only written to in case of an error, but the channel is never closed otherwise. for cleanliness' sake i'd be happy if there is a defer close(errc) in the main goroutine that is on line 249

Hmmm, it makes sense in general, but then the "lingering" goroutines can not send to it. It can be solved in there implementation, but since we abandoned this approach in the error related PR I will not update it now, and will keep this in mind if we switch back. Nice catch, thanks!

acud · 2020-01-06T05:04:30Z

 		p.logger.Error("retrieval.handleRetrieveRequest - peer delivery failed", "ref", msg.Addr, "err", err)
 		osp.LogFields(olog.Bool("delivered", false))
-		return
+		// continue in event loop


@zelig if a chunk is not found in this case an error will be returned, then the peer is dropped as a result (which is incorrect if the chunk cannot be found)

acud · 2020-01-06T05:06:23Z

 		p.logger.Error("netstore error putting chunk to localstore", "err", err)
 		if err == storage.ErrChunkInvalid {
-			p.Drop("invalid chunk in netstore put")
+			return fmt.Errorf("netstore error putting chunk to localstore: %s", err)


if the chunk is invalid then we should definitely drop the peer, but if netstore Put fails we should definitely consider propagating the error too. @janos? @zelig?

acud · 2020-01-06T05:31:42Z

 		delete(p.openOffers, msg.Ruid)
 		p.mtx.Unlock()
-		p.Drop("error sending offered hashes")
+		return fmt.Errorf("error sending offered hashes: %w", err)


i chose to delete the entry for the sake of cleanliness, but indeed it is not necessary since the whole peer object should be disposed of when a send error occurs

acud · 2020-01-06T05:42:36Z

+		return r.requestSubsequentRange(ctx, p, provider, w, msg.LastIndex)
+	}
+
+	if errc == nil {


the only case where errc is nil is when the if statement in line 559 is entered. in that case, the if statement in line 576 would be entered too, and the subsequent range would be requested. as a result, it is not possible that this if block will ever be triggered so it is safe to remove it IMO

acud · 2020-01-06T05:46:15Z

-	select {
-	case <-done:
-	case <-time.After(5 * time.Second):
+


makes sense

acud

LGTM. Please just rebase to resolve conflicts

network, p2p: move msg pauser to protocols

pradovic added 11 commits December 4, 2019 22:54

p2p: handleIncoming is now async

2684963

p2p, network, bzzeth, swap: split sync handshake and async run protoc…

62f4631

…ol calls.

p2p, retreival: graceful shutdown

3b98ab2

p2p, retreival: graceful shutdown

b10f406

p2p/protocol: fixed comments

5185cdb

p2p, network, swap: fixed test errors

2b1b975

p2p, network: cosmetic error change

c00c63a

p2p, network: return good err from async commands

a0aa962

p2p, network: graceful shutdown and bug/race fixes

44e92f6

p2p: remove debug logs

7835bca

bzzeth, network: fix bug and remove debug logs

45df85d

pradovic added the in progress label Dec 10, 2019

pradovic commented Dec 10, 2019

View reviewed changes

pradovic requested review from janos and zelig December 10, 2019 14:07

janos reviewed Dec 10, 2019

View reviewed changes

pradovic added 7 commits December 10, 2019 20:26

p2p, network: remove more bug logs

1f3726f

p2p, network: graceful shutdown in protocol run

c72b07c

bzzeth, network: errors.new to fmt.errof

266ada6

network: fix return error bug

a754cbe

network: fix return error bug

666c0a8

p2p: minor refactoring

f415525

p2p: unit tests for protocol.Receive

38ce6b8

zelig reviewed Dec 13, 2019

View reviewed changes

pradovic added 5 commits December 13, 2019 14:26

p2p: unit tests for protocol.Run

5e8028c

p2p: unit tests for protocol.Run minor fix

3f14018

p2p, bzzeth, network: prune error messages and comments

a6ee45f

p2p: refactor async calls in protocol run

17f62db

p2p: use single global channel in protocol run

797925c

pradovic removed the in progress label Dec 17, 2019

pradovic requested review from janos and zelig December 17, 2019 12:10

pradovic self-assigned this Dec 17, 2019

pradovic added 4 commits December 17, 2019 16:40

p2p: discard the message in stopping protocol

caccb65

fix go.mod

fb366d0

network: hive subPerMsh async

4e6f9bb

network: delete redundant assignment

fcba457

janos reviewed Dec 19, 2019

View reviewed changes

janos approved these changes Dec 19, 2019

View reviewed changes

pradovic added 4 commits December 19, 2019 20:15

network: stream errors more context

17280b3

p2p: disable pprof dump

a71f739

p2p: comment fix

5aec98a

p2p: remove unnecessary comments

68da3dd

pradovic mentioned this pull request Dec 20, 2019

Protocol msg pauser #2058

Merged

janos mentioned this pull request Dec 23, 2019

P2p validate accounting #2051

Merged

acud changed the title ~~p2p, network, bzzeth: p2p protocl handlers to async~~ p2p, network, bzzeth: p2p protocol handlers to async Jan 6, 2020

acud suggested changes Jan 6, 2020

View reviewed changes

network, p2p: fixed per Acuds comments

cf72953

pradovic requested a review from acud January 6, 2020 12:44

acud approved these changes Jan 9, 2020

View reviewed changes

pradovic added 3 commits January 9, 2020 12:32

Protocol msg pauser (#2058)

0040788

network, p2p: move msg pauser to protocols

merge with master

937d6e5

p2p: don't shadow error

913824e

pradovic merged commit 16db47b into master Jan 9, 2020

pradovic deleted the protocol-async-errors branch January 9, 2020 16:34

acud added this to the 0.5.5 milestone Jan 21, 2020

Conversation

pradovic commented Dec 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pradovic Dec 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pradovic Dec 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janos left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pradovic Dec 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zelig left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pradovic commented Dec 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pradovic commented Dec 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pradovic commented Dec 19, 2019

Uh oh!

janos left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pradovic Jan 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pradovic commented Dec 10, 2019 •

edited

Loading

pradovic Dec 10, 2019 •

edited

Loading

pradovic Dec 12, 2019 •

edited

Loading

pradovic Dec 10, 2019 •

edited

Loading

pradovic commented Dec 13, 2019 •

edited

Loading

pradovic commented Dec 17, 2019 •

edited

Loading

pradovic Jan 9, 2020 •

edited

Loading