-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Holocene design doc #72
Conversation
protocol/strict-derivation.md
Outdated
Writing this, I realize that this mechanism could even be used to encode large spans of empty | ||
batches, as long as the sequencer is creating unsafe blocks that follow the same L1 origins as the | ||
auto-derivation would for gaps. However, this would need to be investigated more deeply. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need to very carefully specify the rules for L1 origin selection when inserting a deposit-only block prior to the sequencing window elapsing. There's quite a lot of potential corner cases around that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is also a concern I have, and is probably the most underspecified bit. Random thoughts around that:
- A simple default rule could be to generate blocks in a way to maintain a steady L1/L2 block ratio, e.g. bumping the L1 origin selection every 6 blocks in the case of mainnet.
- Edge case: we were already very near the sequencer drift limit, and need to select a new origin faster.
- Edge case: L1 missed a slot, and the L1 origin cannot advance as expected.
- So maybe a better rule is to first eagerly advance the L1 origin as quickly as possible, and only if a newer L1 origin isn't available, keep it. This solves for missed slots and will implicitly and automatically maintain a good L1/L2 block ratio. We then just need a clear definition of "L1 origin is/isn't available".
- To mimic sequencer behavior, and avoid being hit by shallow L1 reorgs, we could add an in-protocol L1 validation depth. So eagerly advancing the L1 origin while maintaining a timestamp distance of this validation depth times the L1 bock time.
I'll add this to the design doc as an open design question and proposal for Steady Batch Derivation.
protocol/strict-derivation.md
Outdated
invalid batches will be derived as deposit-only blocks. So in case of a reorg, the batcher should | ||
e.g. wait on the sequencer it is connected to until it has derived all blocks from L1 in order to | ||
only start batching new blocks on top of the possibly derived deposit-only chain segment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems very problematic that an L1 reorg might cause deposit-only blocks to be included. That would trigger a reorg of the entire unsafe L2 chain and break full nodes. If the batcher is faulty we don't need to give it a "second chance" to submit a valid batch, but are there cases where a L1 reorg could cause previously submitted batches to now be invalid and these new rules cause the reorg to be larger than it would be just based on the L1 reorg causing changed block origins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking through this a bit, I'm having trouble deciding if this is or isn't a problem.
If an L1 reorg occurs, it only affects the L2 if a batch is wiped out.
If a batch is wiped out, then the batcher's nonce is reverted. Per Seb's proposal "a fixed nonce to block-range assignment", the batcher wants to resubmit on the same range as before, so there isn't a risk of the batcher posting a batch-gap (which would create empty blocks).
🤔 🤔 🤔 but I'm not sure that that's the only way your edge case would present
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a valid concern, and has been flagged by @tynes in his original design doc's risk section.
The current rules of Steady Batch Derivation can indeed lead to a previously valid submitted batch to now be invalid, and then immediately be derived as deposit-only blocks, causing a long L2 unsafe reorg. The same batcher tx may still be included on L1. However, given that span batches are only forward-invalidated with Holocene, I think the L2 unsafe reorg would be limited to the L2 section that references the reorged-out L1 section. However, more batcher txs that might already have landed on L1 as well would cause more deposit-only blocks to be derived.
I think this is the tradeoff of Steady Batch Derivation.
One solution that I can think of that may alleviate this problem is to reference the last L1 origin in the channel metadata (in a new channel format), and then drop the channel directly in the channel bank before even decoding any batches from it, that would then at some point be derived as deposit-only blocks. This way, the batcher could get a "second chance" to submit a channel that includes the correct reorged-to L1 origin chain. This is very similar to how span batches contain the last L1 origin as l1_origin_check
in their prefix, just moved one layer up to the channel container. With such a new channel format, the L1 origin check could arguably be dropped from span batches. Having the channel, rather than any span or singular batch, contain such a L1 origin check has the advantage that the DP wouldn't too eagerly derive deposit-only blocks, and to recover from L1 reorgs. I think this solution would also still maintain the nice properties of Strict Batch Ordering, that there's only one staging channel, and that we don't buffer out of order frames or batches.
If Proofs and Interop experts can confirm that such a solution would still lead to the sought after improvements for Proofs and Interop, resp., we could consider including it as part of Holocene.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this to the design doc in a slightly modified way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of doing this channel L1 origin check, actually just throw away span batches with failing L1 origin check, but give a second chance to be replaced, and don't generate deposit-only blocks for the whole span batch range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fully benefit from this, only decode the prefix first, do the check.
protocol/strict-derivation.md
Outdated
I see two options on how to handle a non-empty batch queue at this point: | ||
- Option 1: Drop future batches, continue resolving undecided batches, if any are left, and apply | ||
new Holocene rules. | ||
- Option 2: The Batch Queue will just start applying Holocene rules from this moment onwards. This will then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option 3: When the L1 origin reaches the Holocene activation block, discard all batches in the batch queue. Could also be applied to the Channel Bank to give us a nice clean starting point.
op-batcher would have to be aware of that rule and consider any blocks it submitted in channels that didn't close prior to the holocene activation block as needing to be resubmitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for flagging, I also thought about this option, but deemed it too drastic. Hearing that you support it makes me consider it! I like its simplicity.
We just need to be careful that the batcher doesn't end up in a very unlikely situation where we're batching with calldata and have a single block that needs to be sent over two calldata frame txs that span across the Holocene activation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if we did find ourselves in that situation, discarding that in-progress submission would work right? It would need to be retried after the threshold.
Or, we could follow Adrian's suggestion and drop all batches, but do it some time before holocene activation, holding a ban on batches until holocene passes. This would allow for the unlikely big-block to resolve prior to passing the threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think it is reasonable to implement some sort of safety behavior for the batcher close to the Holocene activation. We didn't do a good job of adding special hardfork activation logic to the batcher in the past but I think past batcher operating experience has shown that it may be worthwhile adding some.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made this the new and preferred Option 1.
protocol/strict-derivation.md
Outdated
Another open question is how to handle span batches that come from pre-Holocene channels. | ||
I propose that, for simplicity of implementation, if they are found to be valid as a span batch, to | ||
just apply the new Partial Span Batch Validity rules even though those span batches were derived | ||
from pre-Holocene L1 blocks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would only apply to channels that span the Holocene activation right? Anything full pre-holocene would use the old rules and anything full post-Holocene would use the new rules. If so I agree, though it wouldn't be an issue if we discarded any incomplete channels at Holocene activation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would technically also apply to full pre-Holocene span batches, if the channel that contained this span batch was already closed before Holocene activation, but included on L1 only after Holocene activation.
Since correct batcher implementations don't violate this property anyways, I just want to pick the rule with the lowest implementation lift.
protocol/strict-derivation.md
Outdated
|
||
## Out of order frames | ||
|
||
There's an open design space around how to handle some scenarios for missing or out of order frames: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the causes of missing and out-of-order frames currently is in batcher restarts, is that correct? The batcher submits some frames of a channel, but then crashes. Upon restart, a new channel is created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, batcher restarts can in theory cause the batcher to not close an open channel and then submit a new one.
Then there's also the scenario where a batcher cannot get a tx on-chain and requeues the blocks and then attempts sending them again in a new channel. Depending on the specifics of the batcher implementation, this can also lead to out of order frames. All quite unlikely and almost never happening, but we need to take extra care to harden the batcher against such behavior in the future.
protocol/strict-derivation.md
Outdated
invalid batches will be derived as deposit-only blocks. So in case of a reorg, the batcher should | ||
e.g. wait on the sequencer it is connected to until it has derived all blocks from L1 in order to | ||
only start batching new blocks on top of the possibly derived deposit-only chain segment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking through this a bit, I'm having trouble deciding if this is or isn't a problem.
If an L1 reorg occurs, it only affects the L2 if a batch is wiped out.
If a batch is wiped out, then the batcher's nonce is reverted. Per Seb's proposal "a fixed nonce to block-range assignment", the batcher wants to resubmit on the same range as before, so there isn't a risk of the batcher posting a batch-gap (which would create empty blocks).
🤔 🤔 🤔 but I'm not sure that that's the only way your edge case would present
protocol/strict-derivation.md
Outdated
|
||
Note that the new strict ordering rules of the batch queue will always lead to an empty batch queue | ||
when the origin of the derivation pipeline progresses to the next L1 block (what about | ||
`undecided` though?). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with undecided
, and I see another reference to it below. Can you explain how these behave briefly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The undecided
status is still not 💯 clear to me, which is why I added those questions in brackets. What I understand from the implementation (files batches.go
and batch_queue.go
) is that if there's missing L1 or L2 data (e.g. due to temporarily broken L1 or L2 connections?), the batch is re-queued for later checking and no batch is processed at this point. I think we will just retain this behavior.
The difference to future
is that a future batch is already determined to be out of order and lies in the future. The future
case (a gap) will be treated differently in Holocene, and the gap will immediately be derived as deposit-only blocks. It is noteworthy that the future batch can then only be valid if it built on top of a gap of empty batches.
Added this to the design doc.
protocol/strict-derivation.md
Outdated
I see two options on how to handle a non-empty batch queue at this point: | ||
- Option 1: Drop future batches, continue resolving undecided batches, if any are left, and apply | ||
new Holocene rules. | ||
- Option 2: The Batch Queue will just start applying Holocene rules from this moment onwards. This will then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if we did find ourselves in that situation, discarding that in-progress submission would work right? It would need to be retried after the threshold.
Or, we could follow Adrian's suggestion and drop all batches, but do it some time before holocene activation, holding a ban on batches until holocene passes. This would allow for the unlikely big-block to resolve prior to passing the threshold.
One thing to consider here is that we have generally assumed that the batcher will not be malicious and submit batches that are expensive for derivation to process. This assumption can work well for a single chain that has a single sequencer, but in the world of interop the finality of a single chain becomes tied to the finality of another chain. This means that it could be possible for a batcher of a remote chain to influence the cost of proof generation for the local chain. This means that we need to start thinking about untrusted batchers in the context of interop. There will be some reputation at play since the interop set will be managed by governance, but we should generally strive to minimize the amount of reputation required for security. This just means that we don't need to ship the absolute most denial of service proof thing in the world as the first iteration. |
* spellcheck * define forwards/backwards invalidation * define principle of fastest derivation * define "foreign frame" * Update protocol/strict-derivation.md --------- Co-authored-by: Sebastian Stammler <stammler.s@gmail.com>
protocol/strict-derivation.md
Outdated
|
||
# Partial Span Batch Validity | ||
|
||
## Problem Statement + Context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One aspect of the problem that is worth mentioning, although solved by the same partial-validity idea, is that interop fault-proofs require a block to be processed optimistically, and then the proof might later abort on cross-L2 interop-dependencies. After having aborted, the alternative chain continuation should be as straight-forward and minimal as possible, to avoid looping interop dependency checks. Falling back to a deposit-only block, when a batch is invalid, would serve this well.
protocol/strict-derivation.md
Outdated
- Option 1: When the L1 origin reaches the Holocene activation block, discard all frames in the | ||
Frame Queue, channels in the Channel Bank, and batches in the Batch Queue. | ||
This gives us a nice clean starting point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 clean slate during upgrade is nice. We should add some warnings to the batch-submitter about this though, to prevent operational mistakes.
protocol/strict-derivation.md
Outdated
The batcher would have to be aware of those rules and consider any blocks it submitted in channels | ||
that didn't close prior to the Holocene activation block as needing to be resubmitted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also need to make sure that there aren't queued up batcher txs that are included after. These would cause gaps, that would auto-derive, then cause L2 reorg. Add special behavior to the batcher. Don't parallelize 1h before holocene, stay on blobs, etc.
protocol/strict-derivation.md
Outdated
The design space and some proposed solutions will be discussed, together with practical implications | ||
for batcher implementations that have to satisfy the stricter rules. | ||
|
||
# Partial Span Batch Validity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: span batch prefix invalidation doesn't lead to auto-derivation, give a new span batch with correct l1 origin check a chance. This protects against L1 reorg induced deep L2 reorgs.
Further public discussion in Discord made us reconsider the idea of deriving invalid batches as empty batches. The new proposal is to simply drop invalid, and future, batches, and instead only derive invalid payloads at the engine stage as deposit-only payloads. An invalid payload wouldn't trigger the generation of future empty batches, but instead just forward-invalidate any remaining batches and the origin channel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this design doc is now is "good-enough" state and that we can merge it, so we can move towards the actual development work for holocene.
@BlocksOnAChain |
@sebastianst Got it, made sense to me. I was just reviewing the current state, before the last edits and after we chatted on discord. fully agree. |
Description
Holocene design doc, to align on open questions.
Additional context
After getting alignment, the specs can be completed. They are currently in draft at ethereum-optimism/specs#357
This has been reviewed in a public design review session on 2024-09-11, with a public recording.