Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BABE skipped epochs #681

Closed
andresilva opened this issue Jul 24, 2023 · 13 comments
Closed

BABE skipped epochs #681

andresilva opened this issue Jul 24, 2023 · 13 comments
Assignees

Comments

@andresilva
Copy link

Whenever there is a period of inactivity (i.e. no blocks being produced) for more than one epoch BABE is broken, since we announce epochs in advance and by the time the inactivity is resolved we will be in a future epoch which was unnanounced (i.e. no authorities and randomness available). The security model of BABE is also broken in this scenario, but for a long time in the Substrate implementation at least, this would completely brick the chain and make it impossible to author new blocks. As of paritytech/substrate#11727 this was changed, such that empty epochs can be skipped over, i.e. after a period of inactivity the runtime will appropriately update the epoch index based on the given slot. We are still left with the problem of what epoch data to use and the natural thing to do is to re-use the data for the latest announced epoch. As an example, assuming the last block was authored on a slot at epoch 3, the data for epoch 4 had already been announced. Given a period of inactivity spanning 3 epochs, a block is then authored with a slot belonging to epoch 6. In order to author this block, the client should re-use the data that was previously published for epoch 4.

@andresilva
Copy link
Author

andresilva commented Jul 24, 2023

This problem is also described by the Ouroboros team in this paper in chapter 2. They describe a manual mitigation which is the same as we have employed in early 2020 on Kusama (we never had to do it on Polkadot mainnet). They do not provide any description of an automatic mitigation method though.

@Noc2
Copy link
Collaborator

Noc2 commented Jul 24, 2023

Thanks a lot for creating the issue! We will look into it as soon as possible.

@hndnklnc
Copy link

Do we need a solution for only one empty epoch or more than one sequential epochs?

@tomaka
Copy link
Collaborator

tomaka commented Jul 25, 2023

Does the first header after a period of inactivity contain an epoch change digest? My understanding is yes.

For example, let's say that we're in the middle of epoch 3, and all of a sudden all the validators die. Two weeks later, they come back online.
Is it correct that the first authored block is in epoch 9 (or whatever the number is) and announces epoch 10 (or whatever the number is)? And epoch 9 uses the authorities and randomness of epoch 4?

@andresilva
Copy link
Author

@tomaka Everything you said is correct.

@hndnklnc Yes, we need to be able to recover from any kind of disaster scenario. The most likely scenario for this to happen is due to a client bug that prevents block production, and e.g. with epochs on Polkadot being 6 hours, it's not unlikely that a fix would take longer than this time to be developed and rolled out.

Not sure if it's obvious but this issue is about updating the spec to reflect the solution that was already implemented and deployed. AFAIK the current SASSAFRAS implementation is also employing the same solution, so if there is a better solution to handle this problem I suggest we use it there.

@hndnklnc
Copy link

I think the current solution makes sense as far as I understand. I just want to verify this: Assume that epochs from epoch e+1 to e+n are skipped. The validators in epoch e + n +1 use the randomness from epoch e-1, right?

@andresilva
Copy link
Author

andresilva commented Jul 27, 2023

I think we might be using slightly different definitions (i.e. randomness collection vs announcement), let me expand on your example. If epochs E+1 to E+n are skipped then the last block was authored at epoch E, at which point the randomness to be used in E+1 had already been announced. When coming back online, the validators at epoch E+n+1 will use the randomness that was meant to be used in epoch E+1 (which was collected during epoch E-1).

@bhargavbh
Copy link
Contributor

Hi @andresilva thanks for pointing this out. I created a short PR addressing it here #685, can you please take a look.

@tomaka
Copy link
Collaborator

tomaka commented Aug 8, 2023

For what it's worth, the approach I'm going to take in smoldot is to completely remove the concept of epoch_index from the client side. The transcript instead contains (slot_of_block - first_slot_ever) / slots_per_epoch.

It's not a bad thing that on the runtime side the epoch index is still tracked in case it is needed in the future, but on the client side there really isn't any need for this concept anymore.

IMO the modification to the spec should follow a similar approach, because adding a little addendum somewhere creates confusion more than anything else.

@tomaka
Copy link
Collaborator

tomaka commented Aug 8, 2023

Thinking about it more, the interface between the runtime and the client is part of the spec as well and returns an epoch index. It feels weird to not mention that the epoch index returned by the runtime isn't at least partially tied to the actual epoch index calculated in the transcript.

So basically I'm opinion-less.

@andresilva
Copy link
Author

@tomaka I agree that as it stands the epoch index is superflous since it can be trivially calculated. I think this is mainly due to the fact that right now we are relying on unix time to calculate the current slot rather than the "median algorithm". Also for the reason you mentioned I think we should keep the epoch index mention in the spec (regardless of whether client implementations actually need to keep track of it or just calculate it on-the-fly).

@andresilva
Copy link
Author

Closing after #685.

@tomaka
Copy link
Collaborator

tomaka commented Aug 9, 2023

For what it's worth, the approach I'm going to take in smoldot is to completely remove the concept of epoch_index from the client side. The transcript instead contains (slot_of_block - first_slot_ever) / slots_per_epoch.

For future reference, tracking only the "first slot ever" has a drawback, which is that it's not possible to detect when a block contains an epoch change when it's not supposed to.

This can be solved by tracking both the first slot ever and also the starting slot of the current epoch.
But in the end I've kept the concept of epoch_index in smoldot, as it didn't require a lot of code changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants