Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(peerEvents): add a peerEvent Identified #843

Merged
merged 9 commits into from
Jun 21, 2024
Merged

Conversation

lchenut
Copy link
Collaborator

@lchenut lchenut commented Jan 17, 2023

This PR was originally creates to remove a temporary sleepAsync which was a temporary patch against races conditions which were, in theory, fixed by #807.
As #807 didn't solved the problem (explained in details in this comment), the goal of this current PR changed to add a peerEvent callback in order to safely remove the sleepAsync.

As a side note, I also remove some duplicate code. Services were indeed being stopped twice when a switch stopped generating warnings each time this happened.

@Menduist Menduist marked this pull request as draft January 17, 2023 16:00
@kaiserd
Copy link
Collaborator

kaiserd commented May 2, 2024

Why is this in draft state still?

@lchenut
Copy link
Collaborator Author

lchenut commented Jun 13, 2024

After a bit of research on this PR, here's why it exists:
This PR removes await sleepAsync(200.millis) after await self.peerAvailable.wait(). The problem is that removing it creates a race condition. The asyncEvent peerAvailable is fired when a new peer joins. Once it's fired the loop (in the proc innerRun) restarts from the beginning until these two lines of code:

    var connectedPeers = switch.connectedPeers(Direction.Out)    
    connectedPeers.keepItIf(RelayV2HopCodec in switch.peerStore[ProtoBook][it] and    

We get the newly connected peers and only keeping it if RelayV2HopCodec is in the switch.peerStore. The issue here is that the peerStore may not be up to date at this point, hence the sleepAsync(200.millis).

The way it's currently handled is certainly not a perfect solution, but if we want to remove this sleep, we either have to change when the peerStore is filled or when the handlePeerJoined callback is called. Either way, after looking at it, it'll change the call order of different asynchronous procedure and could cause other race conditions if done badly.

With that in mind, I'm not sure what to do about this PR... What do you think @diegomrsantos @kaiserd?

Edits:

  • Upgrade flow refactoring #807 should have solved this issue, but apparently didn't.
  • Another solution is to add another callback handlePeerJoinedAndIdentified (or another name) which will be called after the peerStore is filled.
  • The alternative is to let the sleepAsync(200.millis). Even though it's not perfect, if the identify takes longer than that, it'll fail in the same way.

@lchenut lchenut self-assigned this Jun 13, 2024
@lchenut lchenut changed the title Revert sleepAsync feat(peerEvents): add a peerEvent Identified Jun 14, 2024
@lchenut lchenut marked this pull request as ready for review June 14, 2024 11:28

PeerEvent* = object
case kind*: PeerEventKind
of PeerEventKind.Joined:
of PeerEventKind.Joined, PeerEventKind.Identified:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the autorelay service, we don't need to know if we're the initiator or not. But, for other protocols using this new event, it could be interesting for them to know if the peer identified is Incoming or Outgoing. Identified and Joined are, in the end, really similar, they're just called at two different times.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be fine as this case is minor, but it's better to add functionality only when needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, on second thoughts, we could have used initiator for autorelay. As we use the direction here: https://github.com/vacp2p/nim-libp2p/blob/master/libp2p/services/autorelayservice.nim#L106

So clearly, it might be needed.

@@ -195,6 +195,10 @@ proc internalConnect(
try:
self.connManager.storeMuxer(muxed)
await self.peerStore.identify(muxed)
asyncSpawn self.connManager.triggerPeerEvents(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why asyncSpawn?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we do not know what's the callback will be. It could possibly be an infinite loop for example. AsyncSpawn is safer to avoid this kind of behavior.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it? Running an infinite loop in the background doesn't sound like a good alternative. How about we use await with a timeout?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean by that is that the peerEvent is public, we do not control what libp2p user could do here.
Another point to consider is that this triggerPeerEvent(Identified) is called in important part of the code. Let's take a case where the callback do a lot of computation (let's say 200ms), every time we dial a peer or we accept a connection, 200 ms is spent waiting for the callback to finish before anything else can be done.
And last point, if you look at how the other events are called, they also use asyncSpawn (https://github.com/vacp2p/nim-libp2p/blob/master/libp2p/connmanager.nim#L310-L317)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not convinced. AFAIK, chronos is single-threaded and once this task is executed it won't yield control until it stops running. It's not executed in parallel.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of the scenario of 200 milliseconds it'll slow down every dial by this amount. I think that it's really preferable to give the newly created connection to the user directly and await for these after the dial (or accept) is completed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a private conversation, it's indeed better to not use asyncSpawn here 👍
It's now change to await instead

@@ -204,6 +204,9 @@ proc upgrader(switch: Switch, trans: Transport, conn: Connection) {.async.} =
let muxed = await trans.upgrade(conn, Opt.none(PeerId))
switch.connManager.storeMuxer(muxed)
await switch.peerStore.identify(muxed)
asyncSpawn switch.connManager.triggerPeerEvents(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why asyncSpawn?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.54%. Comparing base (02c96fc) to head (4e2d963).
Report is 25 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff            @@
##           master     #843    +/-   ##
========================================
  Coverage   84.53%   84.54%            
========================================
  Files          91       92     +1     
  Lines       15517    16407   +890     
========================================
+ Hits        13118    13872   +754     
- Misses       2399     2535   +136     
Files Coverage Δ
libp2p/connmanager.nim 92.75% <ø> (-0.18%) ⬇️
libp2p/dialer.nim 91.25% <100.00%> (+0.53%) ⬆️
libp2p/services/autorelayservice.nim 95.14% <100.00%> (+0.24%) ⬆️
libp2p/switch.nim 91.48% <100.00%> (-0.23%) ⬇️

... and 74 files with indirect coverage changes

@lchenut lchenut merged commit 100f318 into master Jun 21, 2024
8 checks passed
@lchenut lchenut deleted the revert-autorelay-sleep branch June 21, 2024 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: done
Development

Successfully merging this pull request may close these issues.

4 participants