-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid closing connection on channel EOF #816
Conversation
While closing an individual channel (e.g., due to cancellation) there is a race where we may still receive messages before we deallocated the channel. Handle that case gracefully to avoid closing down the entire underlying connection (which also holds all other active channels).
try: | ||
await channel.pushData(data) | ||
trace "pushed data to channel", m, channel, len = data.len | ||
except LPStreamClosedError as exc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I'm surprised this didn't get noticed earlier.
What was the specific condition that triggered it? It should also be possible to test this scenario explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--sync-light-client
on Nimbus beacon node sometimes leading to spurious disconnects. There have been multiple fixes w.r.t. that behaviour, hoping that with this fix we are one step closer to being able to support that feature.
What is unique about --sync-light-client
is that light_client_manager
cancels useless request copies once it obtained satisfying response from one of the peers. This cancellation highlights a couple of those issues. Regular request_manager
on the other hand always waits for all peers to reply, even when one already provided the response.
If you know where to extend the test suite, you'd have to send a request, then cancel it (the cancellation would push EOF), and at the same time answer it from the other side (so that there is still a MsgIn). But then you also need to get lucky so that chronos schedules the MsgIn before the channel.join inside the cleanup logic. On Nimbus I get a repro once per ~20 hours currently by spamming local testnets with --sync-light-client
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to mock this in a similar manner to how it's done here - https://github.com/status-im/nim-libp2p/blob/unstable/tests/testmplex.nim#L379, tho simulating the exact raise might be a challenge.
There are a couple of ways this can be potentially tested/reproduced:
- Use a
BufferStream
instead of aTransport
for more fine grained control - Use chronos
stepsAsync
which allows more fine grained control of the event loop - I think this was added precisely to test cancelations and related issues - Use both -
BufferStream
withstepsAsync
Hope this helps!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regular request_manager on the other hand always waits for all peers to reply, even when one already provided the response.
unrelated, but @etan-status this sounds like something to fix, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, request manager overall needs fixing, primarily also in the way how it is fed.
There is some loop that feeds a queue whenever parent block is missing, but it does so at its own pace, irrespective of request manager progression. so, it could happen that multiple requests for the same block are pushed into the request manager (while it is stuck processing an existing request). once request manager gets to the next request, it then requests all those queued up copies of the same block multiple times in a single request, which then could time out / fail once more, but as all the retries by the feeder are already exhausted, it ultimately fails to obtain the block (until it is re-enqueued due to MissingParent error, or so far out of sync that the sync_manager kicks in once more).
The issue in this PR is related to cancellation randomly breaking the underlying connection. It looks much more stable now, so the request manager could now also be adjusted to use cancellation. The thing feeding the request manager seems to be the bigger culprit though (but both issues need to be fixed eventually).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't rely too heavily on cancellation yet, besides these small bugs in libp2p and elsewhere, we still have fundamental issues like status-im/nim-chronos#280
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good to know! Yeah, I noticed also quite a few asyncSpawn
in libp2p code base for which cancellation (and also stopping behaviour, if that even exists) is rather difficult to analyze overall.
I guess fixing the request_manager
feeder regarding retries could already improve gap fill performance quite a bit though. And for the cancellations, once --sync-light-client
is enabled in local CI testnets once more, it at least allows continuous progressive testing of cancellation behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding cancellation, there is also status-im/nim-chronos#334
This one is used in eth2_network
in nimbus-eth2 and leads to weird order of finally / except blocks.
e.g.,
proc makeEth2Request(peer: Peer, protocolId: string, requestBytes: Bytes,
ResponseMsg: type,
timeout: Duration): Future[NetRes[ResponseMsg]]
{.async.} =
let deadline = sleepAsync timeout
let stream = awaitWithTimeout(peer.network.openStream(peer, protocolId),
deadline): return neterr StreamOpenTimeout
try:
# Send the request
# Some clients don't want a length sent for empty requests
# So don't send anything on empty requests
if requestBytes.len > 0:
await stream.writeChunk(none ResponseCode, requestBytes)
# Half-close the stream to mark the end of the request - if this is not
# done, the other peer might never send us the response.
await stream.close()
nbc_reqresp_messages_sent.inc(1, [shortProtocolId(protocolId)])
# Read the response
return await readResponse(stream, peer, ResponseMsg, timeout)
finally:
await stream.closeWithEOF()
The readResponse
then does
let nextFut = conn.readResponseChunk(peer, MsgType)
if not await nextFut.withTimeout(timeout):
return neterr(ReadResponseTimeout)
return nextFut.read()
When the nextFut.withTimeout
is cancelled, due to status-im/nim-chronos#334, it could mean that the await stream.closeWithEOF()
is triggered while the nextFut
is still in the process of getting canceled.
And, the closeWithEOF
has some comments suggesting that if there is still an ongoing read that it may lead to nasty bugs:
proc closeWithEOF*(s: LPStream): Future[void] {.async, public.} =
## Close the stream and wait for EOF - use this with half-closed streams where
## an EOF is expected to arrive from the other end.
##
## Note - this should only be used when there has been an in-protocol
## notification that no more data will arrive and that the only thing left
## for the other end to do is to close the stream gracefully.
##
## In particular, it must not be used when there is another concurrent read
## ongoing (which may be the case during cancellations)!
##
But anyhow, in the way how it is currently used in practice it seems alright for now.
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## unstable #816 +/- ##
============================================
+ Coverage 83.57% 83.73% +0.15%
============================================
Files 81 82 +1
Lines 14696 14914 +218
============================================
+ Hits 12282 12488 +206
- Misses 2414 2426 +12
|
Currently running a perpetual test on nimbus-eth2 to confirm that this fix solves the reliability issues with https://github.com/status-im/nimbus-eth2/commits/dev/etan/z |
There is another instance of potentially overly aggressive connection closing happening here: Notably, a Like
In
|
I may be mistaken, but I don't think these errors can appear here. I don't see any reason not to switch to the more generic error, though |
This run shows an example of
|
Oh right, they are coming from the |
OK. Updated to handle all four Is this also alright for |
Ideally the |
Following up here, I ran about 300 testnets on nimbus-eth2 using CI based on the new logic here, and they all completed without hiccups. Would be great to get this merged, so I can integrate it into nimbus-eth2. Hoping that spurious disconnects are all gone now :-) |
libp2p issues related to operation cancellations have been addressed in vacp2p/nim-libp2p#816 This means we can once more enable `--sync-light-client` in CI, without having to deal with spurious CI failures due to the cancellation issues.
libp2p issues related to operation cancellations have been addressed in vacp2p/nim-libp2p#816 This means we can once more enable `--sync-light-client` in CI, without having to deal with spurious CI failures due to the cancellation issues.
libp2p issues related to operation cancellations have been addressed in vacp2p/nim-libp2p#816 This means we can once more enable `--sync-light-client` in CI, without having to deal with spurious CI failures due to the cancellation issues.
While closing an individual channel (e.g., due to cancellation) there is a race where we may still receive messages before we deallocated the channel. Handle that case gracefully to avoid closing down the entire underlying connection (which also holds all other active channels).