-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Stage 1 block verification failed - DifficultyOutOfBounds #9926
Comments
It would be helpful to have more logs from around the time this happened, especially things like system time on your nodes, and any sealing logs. However, I'd like to note that there aren't any more backports to 1.11.x; you'll mostly be on your own in fixing this issue. |
@mirek are all 4 nodes running the exact same version and spec file? |
Yes except signers, which are different. |
@mirek do you have empty steps enabled? The authorities should re-organize to longer chain anyway, but seems you have a 2/2 split. Did you try stopping the "faulty" nodes that produced the incorrect block(s) and restart them after a while to check if they re-organize? |
Can you also make sure that time on all authorities is perfectly in sync? |
We have:
Clocks seem to be in sync atm, I don't have reasons to believe they were out of sync when this happened, but I can't say for sure. |
The split is 2/1/1 – but yes, there's no majority out of 4 – would that be a reason for missing reorg? We haven't restarted nodes so we can investigate. |
We have 4 nodes - eth1, eth2, eth3 and eth4. Last block in sync was block number 13723, hash 3ce7...824f. Following block did split:
Then nodes continued for hundreds of blocks with their own history keeping this split, no reorg. Here are logs from each node around this time (transaction's data payload replaced with ...bytes... placeholder):
|
Yes, that's expected, cause if the invalid block was detected every child of that blocks is marked invalid as well. Only restarting the node would clear that cache. It seems that you are running with quite small block number (1s?), that's definitely contributes to the issue and requires the time to be in sync all the the time. Could you tell me what's the order of authorities? Whose turn was it to seal I have a theory how it happened, the main issue being that locally-sealed blocks are short-cutted to the chain (so we don't perform validation) + time drift. And seems that the blocks were just created with incorrect step. |
@mirek would you also be so kind to provide the |
Signers are:
They are listed in contract in that order (eth1, eth2, eth3, eth4). Authors of previous blocks were:
Blocks
|
Hi, I'm working with @mirek on the project that's experiencing the issues; we spoke about it again earlier today. @tomusdrw here are the decoded seals from the blocks:
However I think our main problem is the following: From the code we gather that the difficulty must never be greater than or equal to the max value for uint128: However the difficulty of the (forking) block 13724 in the chains of eth2 and eth3 is equal to the max value for uint128:
So could it be eth1 and eth4 correctly rejected those blocks, but eth2 and eth3 just included the (locally) sealed blocks as described above ("short-cutted without validation")? How can we keep the difficulty down? We see that in our chain we start with a low difficulty but it quickly rises to near the max amount - 1:
It fluctuates a bit after the fork, but still stays high:
Is there a way of keeping the difficulty constant or at least in a lower range? How is it calculated (btw we're using gasprice 0)? I guess the difficulty needs to be in the block because it's part of the Ethereum spec, but it doesn't really have a meaning in PoA, does it? |
@mariogemoll thanks for the details, didn't have time yet to look into the seals, but let me answer questions from the last part of your post.
Yes, eth2 and eth3 produced invalid blocks (with wrong difficulty/score), but accepted it because it was produced locally. So the main issue here is to figure out how we got to this situation. My current guess is that one authority issued both a block and an empty step statement in it's turn (say turn
The issue here arises from the fact that you operate in a really tight conditions of super low block time, random (but seems uniform) distribution of transactions, while having empty steps enabled at the same time. So authorities have hard time deciding whether to send
Thanks for the seal data, I'm going to analyze it closely to figure out if my theory holds. |
Hi @tomusdrw, thanks for the quick and detailed response! It all makes much more sense now. The documentation on Aura talks about the score, but I wasn't aware the difficulty field is being "repurposed" for that, thanks for the clarification (maybe should be added to the documentation?). Seems like EmptyStep is not really production-ready yet then? We introduced it to not have so many empty blocks (we reasoned it would give roughly the same finality guarantees), but I guess it would be better to disable it again for now in our case. Maybe we can increase the block time to 2 seconds, but we'd rather not. The nodes are actually in the same datacentre, so we thought 1 second would be feasible. |
I thought about how the block time plays into this / if increasing it would solve the problem. If I understood correctly you said that (with one second block time) a validator would, when it's its turn and there are no transactions, issue an EmptyStep message; if afterwards (during the same second) a transaction comes in it would still seal a block in addition to the EmptyStep. But wouldn't that be a possibility with any block time? I guess a node would always wait until the last second of the step with the EmptyStep, but could still seal a block if it sees a transaction afterwards during that same second. I guess EmptyStep creation/block sealing just need to be made mutually exclusive? |
@tomusdrw do you think changing btw. of course we can provide config values if it helps |
I did some digging into, you can find a visualisation of what happened here: Seems that there are at least 3 issues:
I already spoke with @andresilva and he proposed some solutions to all issues:
@mirek The best workaround for now would be to disable empty steps - note you don't need a hardfork for that, I'll look into patching that. |
We've got authority round with 4 nodes via safe contract, everything was working fine until recently we've got 3 different histories - only 2 nodes are in sync, other two diverged into their own histories.
Related log entry at the moment they diverged is:
We have quite verbose/trace logs available on all 4 nodes if needed to investigate it further.
The text was updated successfully, but these errors were encountered: