Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Holocene design doc #72

Merged
merged 8 commits into from
Oct 4, 2024
Merged

Add Holocene design doc #72

merged 8 commits into from
Oct 4, 2024

Conversation

sebastianst
Copy link
Member

@sebastianst sebastianst commented Sep 4, 2024

Description

Holocene design doc, to align on open questions.

Additional context

After getting alignment, the specs can be completed. They are currently in draft at ethereum-optimism/specs#357

This has been reviewed in a public design review session on 2024-09-11, with a public recording.

protocol/strict-derivation.md Outdated Show resolved Hide resolved
protocol/strict-derivation.md Outdated Show resolved Hide resolved
protocol/strict-derivation.md Outdated Show resolved Hide resolved
protocol/strict-derivation.md Outdated Show resolved Hide resolved
Comment on lines 143 to 145
Writing this, I realize that this mechanism could even be used to encode large spans of empty
batches, as long as the sequencer is creating unsafe blocks that follow the same L1 origins as the
auto-derivation would for gaps. However, this would need to be investigated more deeply.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to very carefully specify the rules for L1 origin selection when inserting a deposit-only block prior to the sequencing window elapsing. There's quite a lot of potential corner cases around that.

Copy link
Member Author

@sebastianst sebastianst Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is also a concern I have, and is probably the most underspecified bit. Random thoughts around that:

  • A simple default rule could be to generate blocks in a way to maintain a steady L1/L2 block ratio, e.g. bumping the L1 origin selection every 6 blocks in the case of mainnet.
  • Edge case: we were already very near the sequencer drift limit, and need to select a new origin faster.
  • Edge case: L1 missed a slot, and the L1 origin cannot advance as expected.
  • So maybe a better rule is to first eagerly advance the L1 origin as quickly as possible, and only if a newer L1 origin isn't available, keep it. This solves for missed slots and will implicitly and automatically maintain a good L1/L2 block ratio. We then just need a clear definition of "L1 origin is/isn't available".
    • To mimic sequencer behavior, and avoid being hit by shallow L1 reorgs, we could add an in-protocol L1 validation depth. So eagerly advancing the L1 origin while maintaining a timestamp distance of this validation depth times the L1 bock time.

I'll add this to the design doc as an open design question and proposal for Steady Batch Derivation.

Comment on lines 171 to 173
invalid batches will be derived as deposit-only blocks. So in case of a reorg, the batcher should
e.g. wait on the sequencer it is connected to until it has derived all blocks from L1 in order to
only start batching new blocks on top of the possibly derived deposit-only chain segment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems very problematic that an L1 reorg might cause deposit-only blocks to be included. That would trigger a reorg of the entire unsafe L2 chain and break full nodes. If the batcher is faulty we don't need to give it a "second chance" to submit a valid batch, but are there cases where a L1 reorg could cause previously submitted batches to now be invalid and these new rules cause the reorg to be larger than it would be just based on the L1 reorg causing changed block origins?

Copy link

@axelKingsley axelKingsley Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking through this a bit, I'm having trouble deciding if this is or isn't a problem.

If an L1 reorg occurs, it only affects the L2 if a batch is wiped out.

If a batch is wiped out, then the batcher's nonce is reverted. Per Seb's proposal "a fixed nonce to block-range assignment", the batcher wants to resubmit on the same range as before, so there isn't a risk of the batcher posting a batch-gap (which would create empty blocks).

🤔 🤔 🤔 but I'm not sure that that's the only way your edge case would present

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valid concern, and has been flagged by @tynes in his original design doc's risk section.

The current rules of Steady Batch Derivation can indeed lead to a previously valid submitted batch to now be invalid, and then immediately be derived as deposit-only blocks, causing a long L2 unsafe reorg. The same batcher tx may still be included on L1. However, given that span batches are only forward-invalidated with Holocene, I think the L2 unsafe reorg would be limited to the L2 section that references the reorged-out L1 section. However, more batcher txs that might already have landed on L1 as well would cause more deposit-only blocks to be derived.

I think this is the tradeoff of Steady Batch Derivation.

One solution that I can think of that may alleviate this problem is to reference the last L1 origin in the channel metadata (in a new channel format), and then drop the channel directly in the channel bank before even decoding any batches from it, that would then at some point be derived as deposit-only blocks. This way, the batcher could get a "second chance" to submit a channel that includes the correct reorged-to L1 origin chain. This is very similar to how span batches contain the last L1 origin as l1_origin_check in their prefix, just moved one layer up to the channel container. With such a new channel format, the L1 origin check could arguably be dropped from span batches. Having the channel, rather than any span or singular batch, contain such a L1 origin check has the advantage that the DP wouldn't too eagerly derive deposit-only blocks, and to recover from L1 reorgs. I think this solution would also still maintain the nice properties of Strict Batch Ordering, that there's only one staging channel, and that we don't buffer out of order frames or batches.

If Proofs and Interop experts can confirm that such a solution would still lead to the sought after improvements for Proofs and Interop, resp., we could consider including it as part of Holocene.

Copy link
Member Author

@sebastianst sebastianst Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this to the design doc in a slightly modified way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing this channel L1 origin check, actually just throw away span batches with failing L1 origin check, but give a second chance to be replaced, and don't generate deposit-only blocks for the whole span batch range.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fully benefit from this, only decode the prefix first, do the check.

I see two options on how to handle a non-empty batch queue at this point:
- Option 1: Drop future batches, continue resolving undecided batches, if any are left, and apply
new Holocene rules.
- Option 2: The Batch Queue will just start applying Holocene rules from this moment onwards. This will then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option 3: When the L1 origin reaches the Holocene activation block, discard all batches in the batch queue. Could also be applied to the Channel Bank to give us a nice clean starting point.

op-batcher would have to be aware of that rule and consider any blocks it submitted in channels that didn't close prior to the holocene activation block as needing to be resubmitted.

Copy link
Member Author

@sebastianst sebastianst Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for flagging, I also thought about this option, but deemed it too drastic. Hearing that you support it makes me consider it! I like its simplicity.

We just need to be careful that the batcher doesn't end up in a very unlikely situation where we're batching with calldata and have a single block that needs to be sent over two calldata frame txs that span across the Holocene activation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we did find ourselves in that situation, discarding that in-progress submission would work right? It would need to be retried after the threshold.

Or, we could follow Adrian's suggestion and drop all batches, but do it some time before holocene activation, holding a ban on batches until holocene passes. This would allow for the unlikely big-block to resolve prior to passing the threshold.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think it is reasonable to implement some sort of safety behavior for the batcher close to the Holocene activation. We didn't do a good job of adding special hardfork activation logic to the batcher in the past but I think past batcher operating experience has shown that it may be worthwhile adding some.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this the new and preferred Option 1.

Comment on lines 237 to 240
Another open question is how to handle span batches that come from pre-Holocene channels.
I propose that, for simplicity of implementation, if they are found to be valid as a span batch, to
just apply the new Partial Span Batch Validity rules even though those span batches were derived
from pre-Holocene L1 blocks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would only apply to channels that span the Holocene activation right? Anything full pre-holocene would use the old rules and anything full post-Holocene would use the new rules. If so I agree, though it wouldn't be an issue if we discarded any incomplete channels at Holocene activation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would technically also apply to full pre-Holocene span batches, if the channel that contained this span batch was already closed before Holocene activation, but included on L1 only after Holocene activation.

Since correct batcher implementations don't violate this property anyways, I just want to pick the rule with the lowest implementation lift.

@BlocksOnAChain BlocksOnAChain added documentation Improvements or additions to documentation specs labels Sep 5, 2024
protocol/strict-derivation.md Outdated Show resolved Hide resolved

## Out of order frames

There's an open design space around how to handle some scenarios for missing or out of order frames:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the causes of missing and out-of-order frames currently is in batcher restarts, is that correct? The batcher submits some frames of a channel, but then crashes. Upon restart, a new channel is created.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, batcher restarts can in theory cause the batcher to not close an open channel and then submit a new one.
Then there's also the scenario where a batcher cannot get a tx on-chain and requeues the blocks and then attempts sending them again in a new channel. Depending on the specifics of the batcher implementation, this can also lead to out of order frames. All quite unlikely and almost never happening, but we need to take extra care to harden the batcher against such behavior in the future.

protocol/strict-derivation.md Outdated Show resolved Hide resolved
Comment on lines 171 to 173
invalid batches will be derived as deposit-only blocks. So in case of a reorg, the batcher should
e.g. wait on the sequencer it is connected to until it has derived all blocks from L1 in order to
only start batching new blocks on top of the possibly derived deposit-only chain segment.
Copy link

@axelKingsley axelKingsley Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking through this a bit, I'm having trouble deciding if this is or isn't a problem.

If an L1 reorg occurs, it only affects the L2 if a batch is wiped out.

If a batch is wiped out, then the batcher's nonce is reverted. Per Seb's proposal "a fixed nonce to block-range assignment", the batcher wants to resubmit on the same range as before, so there isn't a risk of the batcher posting a batch-gap (which would create empty blocks).

🤔 🤔 🤔 but I'm not sure that that's the only way your edge case would present


Note that the new strict ordering rules of the batch queue will always lead to an empty batch queue
when the origin of the derivation pipeline progresses to the next L1 block (what about
`undecided` though?).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with undecided, and I see another reference to it below. Can you explain how these behave briefly?

Copy link
Member Author

@sebastianst sebastianst Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The undecided status is still not 💯 clear to me, which is why I added those questions in brackets. What I understand from the implementation (files batches.go and batch_queue.go) is that if there's missing L1 or L2 data (e.g. due to temporarily broken L1 or L2 connections?), the batch is re-queued for later checking and no batch is processed at this point. I think we will just retain this behavior.

The difference to future is that a future batch is already determined to be out of order and lies in the future. The future case (a gap) will be treated differently in Holocene, and the gap will immediately be derived as deposit-only blocks. It is noteworthy that the future batch can then only be valid if it built on top of a gap of empty batches.

Added this to the design doc.

I see two options on how to handle a non-empty batch queue at this point:
- Option 1: Drop future batches, continue resolving undecided batches, if any are left, and apply
new Holocene rules.
- Option 2: The Batch Queue will just start applying Holocene rules from this moment onwards. This will then

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we did find ourselves in that situation, discarding that in-progress submission would work right? It would need to be retried after the threshold.

Or, we could follow Adrian's suggestion and drop all batches, but do it some time before holocene activation, holding a ban on batches until holocene passes. This would allow for the unlikely big-block to resolve prior to passing the threshold.

protocol/strict-derivation.md Outdated Show resolved Hide resolved
@tynes
Copy link
Contributor

tynes commented Sep 5, 2024

One thing to consider here is that we have generally assumed that the batcher will not be malicious and submit batches that are expensive for derivation to process. This assumption can work well for a single chain that has a single sequencer, but in the world of interop the finality of a single chain becomes tied to the finality of another chain. This means that it could be possible for a batcher of a remote chain to influence the cost of proof generation for the local chain. This means that we need to start thinking about untrusted batchers in the context of interop.

There will be some reputation at play since the interop set will be managed by governance, but we should generally strive to minimize the amount of reputation required for security. This just means that we don't need to ship the absolute most denial of service proof thing in the world as the first iteration.

protocol/strict-derivation.md Outdated Show resolved Hide resolved
protocol/strict-derivation.md Outdated Show resolved Hide resolved
geoknee and others added 4 commits September 9, 2024 20:18
* spellcheck

* define forwards/backwards invalidation

* define principle of fastest derivation

* define "foreign frame"

* Update protocol/strict-derivation.md

---------

Co-authored-by: Sebastian Stammler <stammler.s@gmail.com>

# Partial Span Batch Validity

## Problem Statement + Context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One aspect of the problem that is worth mentioning, although solved by the same partial-validity idea, is that interop fault-proofs require a block to be processed optimistically, and then the proof might later abort on cross-L2 interop-dependencies. After having aborted, the alternative chain continuation should be as straight-forward and minimal as possible, to avoid looping interop dependency checks. Falling back to a deposit-only block, when a batch is invalid, would serve this well.

protocol/strict-derivation.md Outdated Show resolved Hide resolved
protocol/strict-derivation.md Outdated Show resolved Hide resolved
Comment on lines 269 to 271
- Option 1: When the L1 origin reaches the Holocene activation block, discard all frames in the
Frame Queue, channels in the Channel Bank, and batches in the Batch Queue.
This gives us a nice clean starting point.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 clean slate during upgrade is nice. We should add some warnings to the batch-submitter about this though, to prevent operational mistakes.

Comment on lines 292 to 293
The batcher would have to be aware of those rules and consider any blocks it submitted in channels
that didn't close prior to the Holocene activation block as needing to be resubmitted.
Copy link
Member Author

@sebastianst sebastianst Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to make sure that there aren't queued up batcher txs that are included after. These would cause gaps, that would auto-derive, then cause L2 reorg. Add special behavior to the batcher. Don't parallelize 1h before holocene, stay on blobs, etc.

The design space and some proposed solutions will be discussed, together with practical implications
for batcher implementations that have to satisfy the stricter rules.

# Partial Span Batch Validity
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: span batch prefix invalidation doesn't lead to auto-derivation, give a new span batch with correct l1 origin check a chance. This protects against L1 reorg induced deep L2 reorgs.

@sebastianst
Copy link
Member Author

Further public discussion in Discord made us reconsider the idea of deriving invalid batches as empty batches. The new proposal is to simply drop invalid, and future, batches, and instead only derive invalid payloads at the engine stage as deposit-only payloads. An invalid payload wouldn't trigger the generation of future empty batches, but instead just forward-invalidate any remaining batches and the origin channel.

Copy link
Contributor

@BlocksOnAChain BlocksOnAChain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this design doc is now is "good-enough" state and that we can merge it, so we can move towards the actual development work for holocene.

@sebastianst
Copy link
Member Author

I think this design doc is now is "good-enough" state and that we can merge it, so we can move towards the actual development work for holocene.

@BlocksOnAChain
The design changed significantly due to the async discussions after the design session. I want to first finish the spec, then adapt the design doc so it can serve as a historical reference, before we merge it.

@BlocksOnAChain
Copy link
Contributor

BlocksOnAChain commented Sep 18, 2024

I think this design doc is now is "good-enough" state and that we can merge it, so we can move towards the actual development work for holocene.

@BlocksOnAChain The design changed significantly due to the async discussions after the design session. I want to first finish the spec, then adapt the design doc so it can serve as a historical reference, before we merge it.

@sebastianst Got it, made sense to me. I was just reviewing the current state, before the last edits and after we chatted on discord. fully agree.

@sebastianst sebastianst merged commit 85194ab into main Oct 4, 2024
@sebastianst sebastianst deleted the seb/holocene-derivation branch October 4, 2024 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation specs
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

9 participants