Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add threshold flexbility to start a level #34

Open
nkeywal opened this issue Dec 28, 2018 · 11 comments
Open

Add threshold flexbility to start a level #34

nkeywal opened this issue Dec 28, 2018 · 11 comments

Comments

@nkeywal
Copy link
Contributor

nkeywal commented Dec 28, 2018

Today we start a level only when it has all the signatures.
We could have something smarter when the missing signature comes from a node that should have communicated long ago.

Technically, we could have a module to identify suspicious nodes. If the missing sigs comes from a suspicious node we start the level. A node would become suspicious if it hasn't sent its signature after a given delay or of it hasn't responded when we use tcp or quic to communicate.

@nikkolasg
Copy link
Collaborator

Some comments (after new year's eve so bear with me :D ):

  • I'm wondering if detecting suspicious node is in fact not really possible with our current model of Handel. Since (1) a Handel node is supposed to contact nodes at a given level in a random order and (2) not necessarily all of them - if it finishes before -.
  • This issue raises the subject of "asynchronous" level starting since a level should be started at anytime -> therefore, the timeout strategy should not be dependent on the periodic update loop as it is the case right now.

@nkeywal
Copy link
Contributor Author

nkeywal commented Jan 8, 2019

The idea is:
Let's say that you are the first node (id 1).
At level 1 you're supposed to receive 2's signature but you don't: 2 is down.
At level 2 you receive 3&4 => You now have 134
At level 3 you receive 5678 => You now have 1345678
The question here is: you don't have all the signature for the level 4, you're still waiting for the signature from 2. But you've been waiting for 3 levels now. So should you wait for level 4 to timeout or should you just start now?

@nikkolasg
Copy link
Collaborator

Ok - it goes back to the "threshold per level" idea where we could pass up quickly to another level if we have "enough" signatures.
We could do that by making Handel exposing a RegisterPostProcessor or RegisterActor that takes a

type PostProcessor interface {
     OnVerifiedSignature(sp *sigPair) 
}

interface. So a TimeoutStrategy can register itself and looks for all verified packets and potentially launch a packet in the case the "threshold per level" is attained (or other conditions we did not yet thought of)

@nkeywal
Copy link
Contributor Author

nkeywal commented Jan 8, 2019

Ok - it goes back to the "threshold per level" idea where we could pass up quickly to another level if we have "enough" signatures.

Not exactly, because we're trying to use a technical (a node that hasn't communicated since 1 second is likely dead) vs. a functional heuristic (on average we will have 80% of the nodes so let's not wait longer if we have enough sigs). The first could be use even if you would ideally want to have all sigs at all levels.

This said we don't have to implement this now , especially ( :-) ) if we don't think it makes a difference. I will try in the simulator and report back.

@nkeywal
Copy link
Contributor Author

nkeywal commented Jan 9, 2019

It's possible to simplify it with:

  • if at a level we have received all the signatures specific to this level from our peers we can start the next level. eg. we don't care about the n-2, n-3 levels, only the n-1. This way the fact that we wait for a sig at the n-2 level does not prevent us to start the level n if n-1 is ok.

In practice, in the simulator it makes no difference for most scenarios.
The reason is simple:

  • to reach the threshold we need to reach the last level. We can't reach a threshold of 51% if we're not at the last level.
  • if they are dead nodes the last level will be triggered only when the timeout will be reached: the mechanism above does not work because there are always dead nodes at the last level. It accelerates only the first levels.
  • but as we rely on a timeout for the last level, the fact that we're faster on the previous levels is not really important
  • things are a little bit different if don't wait for all the nodes, if instead of waiting for all the sigs we're happy with 80% then we do start the levels sooner, because we bypass the timeout and we don't wait for the dead nodes anyway. Note that in this case we don't rely on the fact that we're sending a lot of messages when we complete a level: we don't consider the level as completed so we rely only on the periodic update.

WITH this new feature (all sigs)
round=0, GSFSignature, nodes=2048, threshold=1351, pairing=3ms, level timeout=100ms, period=10ms, acceleratedCallsCount=10, dead nodes=100, network=NetworkLatencyByDistance
bytes sent: min: 19207, max:24003, avg:20986
bytes rcvd: min: 12198, max:21453, avg:16490
msg sent: min: 255, max:336, avg:288
msg rcvd: min: 184, max:305, avg:239
done at: min: 1113, max:1278, avg:1172

WITHOUT
bytes sent: min: 19109, max:23292, avg:20483
bytes rcvd: min: 12736, max:21236, avg:15983
msg sent: min: 253, max:330, avg:279
msg rcvd: min: 192, max:300, avg:230
done at: min: 1113, max:1277, avg:1172

WITH (80% of the sigs)
bytes sent: min: 15456, max:22413, avg:18433
bytes rcvd: min: 10220, max:17473, avg:13972
msg sent: min: 211, max:293, avg:249
msg rcvd: min: 150, max:243, avg:200
done at: min: 578, max:812, avg:689

With different settings:
round=0, GSFSignature, nodes=4096, threshold=2703, pairing=3ms, level timeout=200ms, period=20ms, acceleratedCallsCount=10, dead nodes=100, network=NetworkLatencyByDistance
WITH
bytes sent: min: 26484, max:38797, avg:29118
bytes rcvd: min: 19480, max:33269, avg:25751
msg sent: min: 300, max:481, avg:345
msg rcvd: min: 252, max:414, avg:317
done at: min: 2413, max:2672, avg:2483

WITHOUT
bytes sent: min: 26337, max:31661, avg:27996
bytes rcvd: min: 19071, max:33244, avg:24624
msg sent: min: 297, max:383, avg:328
msg rcvd: min: 247, max:425, avg:299
done at: min: 2412, max:2674, avg:2483

WITH .80%:
bytes sent: min: 15153, max:22722, avg:19470
bytes rcvd: min: 10820, max:23487, avg:16178
msg sent: min: 183, max:254, avg:221
msg rcvd: min: 151, max:242, avg:193
done at: min: 677, max:1015, avg:819

@nkeywal
Copy link
Contributor Author

nkeywal commented Jan 9, 2019

In other words, it's interesting. Need to think about it more.

@nikkolasg
Copy link
Collaborator

The 80% thing makes a huge difference, it's surprising... do you have "failed" nodes in your setup ?? Otherwise it's weird that it makes such a difference, since all signatures are supposed to be complete for a given level no ?

On your first previous point, "if sig is complete for a level, then start the next one": @bkolad saw last week that handel took more time with n != 2^x rather than when using a power of two, i.e. 24 nodes took more time than 32 nodes. I fixed this behavior by exactly introducing this rule, and the behavior was gone. The reason is that when using not a power of two, the last nodeS have missing levels, and they were only relying on timeout to go the next, but they were waiting a lot of time since there could have been 2 or 3 levels to wait for.
=> the first rule is already there and apparently needed.

@nkeywal
Copy link
Contributor Author

nkeywal commented Jan 9, 2019

The 80% thing makes a huge difference, it's surprising... do you have "failed" nodes in your setup ??

Of course, if not this setting is more or less useless.
See the desc above:
round=0, GSFSignature, nodes=4096, threshold=2703, pairing=3ms, level timeout=200ms, period=20ms, acceleratedCallsCount=10, dead nodes=100, network=NetworkLatencyByDistance

=> the first rule is already there and apparently needed.

What I measured in the simulator is a little bit different (but there is an overlap between the two):
We don't start the level if we have 80% of the signatures we would like to send. We start a level if our peers at the previous level (n-1, not n-2) sent us 80% of the sigs. It makes a difference at the first levels when our peers are dead. It's more generic.

@nkeywal
Copy link
Contributor Author

nkeywal commented Jan 9, 2019

they were only relying on timeout to go the next

That's more of less the current logic: very aggressive timeouts. You can see in the results above that the improvement is much more interesting with a timeout of 200ms than a timeout of 100ms.

@nikkolasg
Copy link
Collaborator

nikkolasg commented Jan 9, 2019

We start a level if our peers at the previous level (n-1, not n-2) sent us 80% of the sigs. It makes a difference at the first levels when our peers are dead. It's more generic.

Ok good, a more static version of this is implemented right now in handel.go but with the full size. So as a "plan", we should add a ThresholdLevel field (or other name) to the config, and when lvl i has already ThresholdLevel% signatures, we start level i+1. Does that sounds good ?

@nkeywal
Copy link
Contributor Author

nkeywal commented Jan 14, 2019

As discussed, let's try with a timeout of 50ms short term.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants