-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mailbox device nondeterminism after kernel restart #3471
Milestone
Comments
warner
added a commit
that referenced
this issue
Jul 19, 2021
The mailbox device tracking the highest inbound message number and ack for each peer, to de-duplicate repeated messages, so it could reduce the amount of kernel activity. Each call to `deliverInbound` would return a boolean to indicate whether the messages/ack were new, and thus the kernel needed to be cycled. However, the device was holding this tracking data in non-durable state, so if/when the kernel was restarted, the state would be lost. A duplicate message/ack arriving in the restarted process would trigger kernel activity that would not have run in the original process. These extra cranks caused diverge between validators when one of them was restarted, and the client sent a duplicate message (such as the pre-emptive `ack` all clients send at startup). The extra crank does not get very far, because vattp does its own deduplication, so the divergence was only visible in the slog. But when #3442 is implemented, even a single extra crank will flag the validator as out of consensus. The fix is to remove the mailbox device's dedup code, and rely upon vattp for this function. The test was also updated to match, and a new test (comparing two parallel kernels, one restarted, one not) was added. closes #3471
warner
added a commit
that referenced
this issue
Jul 22, 2021
The mailbox device tracking the highest inbound message number and ack for each peer, to de-duplicate repeated messages, so it could reduce the amount of kernel activity. Each call to `deliverInbound` would return a boolean to indicate whether the messages/ack were new, and thus the kernel needed to be cycled. However, the device was holding this tracking data in non-durable state, so if/when the kernel was restarted, the state would be lost. A duplicate message/ack arriving in the restarted process would trigger kernel activity that would not have run in the original process. These extra cranks caused diverge between validators when one of them was restarted, and the client sent a duplicate message (such as the pre-emptive `ack` all clients send at startup). The extra crank does not get very far, because vattp does its own deduplication, so the divergence was only visible in the slog. But when #3442 is implemented, even a single extra crank will flag the validator as out of consensus. The fix is to remove the mailbox device's dedup code, and rely upon vattp for this function. The test was also updated to match, and a new test (comparing two parallel kernels, one restarted, one not) was added. closes #3471
warner
added a commit
that referenced
this issue
Jul 22, 2021
The mailbox device tracking the highest inbound message number and ack for each peer, to de-duplicate repeated messages, so it could reduce the amount of kernel activity. Each call to `deliverInbound` would return a boolean to indicate whether the messages/ack were new, and thus the kernel needed to be cycled. However, the device was holding this tracking data in non-durable state, so if/when the kernel was restarted, the state would be lost. A duplicate message/ack arriving in the restarted process would trigger kernel activity that would not have run in the original process. These extra cranks caused diverge between validators when one of them was restarted, and the client sent a duplicate message (such as the pre-emptive `ack` all clients send at startup). The extra crank does not get very far, because vattp does its own deduplication, so the divergence was only visible in the slog. But when #3442 is implemented, even a single extra crank will flag the validator as out of consensus. The fix is to remove the mailbox device's dedup code, and rely upon vattp for this function. The test was also updated to match, and a new test (comparing two parallel kernels, one restarted, one not) was added. closes #3471
warner
added a commit
that referenced
this issue
Jul 22, 2021
The mailbox device tracking the highest inbound message number and ack for each peer, to de-duplicate repeated messages, so it could reduce the amount of kernel activity. Each call to `deliverInbound` would return a boolean to indicate whether the messages/ack were new, and thus the kernel needed to be cycled. However, the device was holding this tracking data in non-durable state, so if/when the kernel was restarted, the state would be lost. A duplicate message/ack arriving in the restarted process would trigger kernel activity that would not have run in the original process. These extra cranks caused diverge between validators when one of them was restarted, and the client sent a duplicate message (such as the pre-emptive `ack` all clients send at startup). The extra crank does not get very far, because vattp does its own deduplication, so the divergence was only visible in the slog. But when #3442 is implemented, even a single extra crank will flag the validator as out of consensus. The fix is to remove the mailbox device's dedup code, and rely upon vattp for this function. The test was also updated to match, and a new test (comparing two parallel kernels, one restarted, one not) was added. closes #3471
warner
added a commit
that referenced
this issue
Jul 22, 2021
The mailbox device tracking the highest inbound message number and ack for each peer, to de-duplicate repeated messages, so it could reduce the amount of kernel activity. Each call to `deliverInbound` would return a boolean to indicate whether the messages/ack were new, and thus the kernel needed to be cycled. However, the device was holding this tracking data in non-durable state, so if/when the kernel was restarted, the state would be lost. A duplicate message/ack arriving in the restarted process would trigger kernel activity that would not have run in the original process. These extra cranks caused diverge between validators when one of them was restarted, and the client sent a duplicate message (such as the pre-emptive `ack` all clients send at startup). The extra crank does not get very far, because vattp does its own deduplication, so the divergence was only visible in the slog. But when #3442 is implemented, even a single extra crank will flag the validator as out of consensus. The fix is to remove the mailbox device's dedup code, and rely upon vattp for this function. The test was also updated to match, and a new test (comparing two parallel kernels, one restarted, one not) was added. closes #3471
warner
added a commit
that referenced
this issue
Jul 22, 2021
The mailbox device tracking the highest inbound message number and ack for each peer, to de-duplicate repeated messages, so it could reduce the amount of kernel activity. Each call to `deliverInbound` would return a boolean to indicate whether the messages/ack were new, and thus the kernel needed to be cycled. However, the device was holding this tracking data in non-durable state, so if/when the kernel was restarted, the state would be lost. A duplicate message/ack arriving in the restarted process would trigger kernel activity that would not have run in the original process. These extra cranks caused diverge between validators when one of them was restarted, and the client sent a duplicate message (such as the pre-emptive `ack` all clients send at startup). The extra crank does not get very far, because vattp does its own deduplication, so the divergence was only visible in the slog. But when #3442 is implemented, even a single extra crank will flag the validator as out of consensus. The fix is to remove the mailbox device's dedup code, and rely upon vattp for this function. The test was also updated to match, and a new test (comparing two parallel kernels, one restarted, one not) was added. closes #3471
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
@dckc @mhofman and @warner have been analyzing the slogfiles collected from the agorictest-16 testnet (the "stress test" phase of the incentivized testnet). We found a divergence between two slogfiles, in which one validator observed an additional
deliverInboundAck
delivery (to vat-vattp), which was not in the record of a second validator. As a result, the globalcrankNums
were forevermore off-by-one in that validator, and the vattpdeliveryNum
was also off-by-one.I think this is a consequence of the mailbox device keeping the
highestInboundAck
state in RAM, rather than in durable storage.SwingSet Mailbox Device Communication Protocol
The mailbox device follows the Waterken approach to durable delivery of messages: send forever until acked. When the mailbox has a connection-oriented link to some remote, we don't need to deliver a message or an ack more than once per connection, but we also can't afford to not send either when a new connection is established. So the process is:
This approach leads to the minimum traffic that still meets the goal of delivering and acking message as quickly as possible. If promptness weren't the goal, we could reduce the traffic slightly by not preemptively sending the ack number (and just waiting for the next outbound message to include it). The idea was that ack numbers are pretty small, so sending an extra copy at process startup (or TCP connection establishment) isn't expensive. A more drastic change would be to omit the preemptive message send too: don't send anything until the next outbound message is generated. This would run the risk of stalling messages for arbitrary amounts of time, which would be sad.
Mailbox device deduplication
This preemptive/precautionary re-transmission of messages+acks feels necessary but wasteful. I attempted to mitigate the overhead a little bit by having the mailbox device perform an initial deduplication of messages:
agoric-sdk/packages/SwingSet/src/devices/mailbox-src.js
Lines 32 to 41 in 467eb9f
and the ack:
agoric-sdk/packages/SwingSet/src/devices/mailbox-src.js
Lines 42 to 48 in 467eb9f
If the new message and/or ack passes this test, the mailbox device will enqueue a
deliverInboundMessages()
and/ordeliverInboundAck()
delivery tovat-vattp
.The mailbox state that remembers the inbound message and ack numbers is not durable: it is kept in a simple
Map
:agoric-sdk/packages/SwingSet/src/devices/mailbox-src.js
Lines 6 to 9 in 467eb9f
The VatTP vat also does a deduplication check, because the mailbox state (which remembers these sequence numbers) is not durable. So when the receiving swingset is restarted, the next arrival of a message from the outside world might not be deduplicated. So VatTP must always be prepared to dedup things itself. The comms vat receives a strictly-sequenced no-duplicates stream of messages (and therefore doesn't even get to see the vattp sequence numbers).
solo-to-chain protocol
The TCP approach only really makes sense for solo-to-solo connections, which are run over TCP of some sort. For the solo-to-chain pathway, we don't have connections, instead we have a series of signed messages (cosmos transactions) that contain the same
{ messages: [], ackNum }
that the TCP payloads would have included.However the delivery code was written with TCP in mind, so when the solo machine comes up, it sends all its unacked messages (and the
inboundAckNum
) before doing anything else, just in case the last time it sent these, the message didn't get through.(TODO number 1: find a way to change the solo-to-chain delivery protocol to read the chain's outbound ack num first, from an RPC query, without requiring a transaction, instead of making an extra transaction per client restart. Or, switch to IBC, which does something very much like this already)
So when the client restarts, we'll get an extra signed transaction, which will probably contain zero messages, plus an
inboundAckNum
.chain-side Mailbox device deduplication is not deterministic
Now imagine two validators which are started at time T=0, and a client which starts up some time later and sends a bunch of messages. The responses come back to the client, and the last thing the client does is to ack the last response, causing a
{ messages: [], ackNum: 5 }
at time T=10.Now say the client is restarted at T=20, so (as described above), the client will re-send this same message. Our cosmos signed transaction stream will contain a txn at T=10, and a second at T=20, both with no mailbox messages and with an
ackNum
of "5".Let's say the "left" validator remains running the whole time. When the T=10 txn arrives, the Mailbox device will see
ackNum: 5
, compare it to itshighestInboundAck
(which would probably be "4"), conclude that vat-vattp needs to see this, and useSO(inboundHandler).deliverInboundAck(peer, ack)
to enqueue a message to vattp. Then, later, when the T=20 txn arrives, the Mailbox device still remembershighestInboundAck=5
, and ignores the message, not enqueueing anything.But let's say the "right" validator was restarted at T=15. The T=10 txn was already delivered to vattp. But when the T=20 txn arrives,
highestInboundAck
is empty (because theMap
that held it went away when the validator was restarted). So the mailbox device will enqueue adeliverInboundAck
to vat-vattp.This leads to a divergence between the two validators: the "right" validator will see an extra run-queue entry for the extra
deliverInboundAck
message. This will cause all subsequentcrankNum
values to be higher by one, as well as causing the vat-vattpdeliveryNum
values to be higher by one.Since vat-vattp does its own deduplication, the
deliverInboundAck
delivery should not create any syscalls, and the effects will not propogate to other vats. Our current scheduling policy is strictly FIFO, and our current block-filling policy is effectively "drain the run-queue entirely" (there's a 1000-crank limit but we haven't observed that being reached yet), so the extra delivery is performed at the beginning of the block, and doesn't cause any other deliveries to be pushed out of the block. So the divergence is "hidden": it doesn't affect SwingSet's behavior enough for the chain to notice. We only discovered it by comparing slogfiles between validators that experienced different patterns of restarts, in particular a restart that occurred between two duplicate client-driven transactions.The Fix
We have two approaches we could use.
The first, and simplest, is to just remove the deduplication code from the Mailbox device. The code should simply sort the inbound messages and deliver all of them to vattp, then deliver the ackNum to vattp as well. The two
Maps
should be deleted. The code will always returndidSomething=true
, which means the host will always be told to run the kernel after the potential delivery. This causes more work to be done in the case of duplicate deliveries, but the swingset work will be a deterministic function of the Mailbox input, which is important.The second approach would be to track the Mailbox device's state durably. Devices get a
getDeviceState
/setDeviceState
authority for this purpose, and other devices (e.g. the timer device) use them properly, but the Mailbox device is older and does not. We'd need to store the full contents ofhighestInboundDelivered
andhighestInboundAck
in durable storage, instead of the Map. At device startup, we'd callgetDeviceState()
and populate an in-memory map. After every inbound message, we'd update the map with the new sequence/ack numbers, convert the Map back into something thatmarshal
could serialize, and then write it all back tosetDeviceState()
.(This is messier than what the comms vat must do, because vats have
syscall.vatStoreSet(key, value)
, which takes multiple keys. Devices are much more primitive. We could make the second approach easier and faster by introducingvatStore
syscalls to devices. Then it could record each seqnum/acknum record in a separate key, instead of needing to smoosh everything together into the singlesetDeviceState()
body).I'm pretty sure we'll do the first approach: deleting code is always more reliable than adding.
The text was updated successfully, but these errors were encountered: