Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify the difference between SYNCING and ACCEPTED #215

Merged
merged 2 commits into from
May 13, 2022
Merged

Conversation

djrtwo
Copy link
Contributor

@djrtwo djrtwo commented May 3, 2022

The spec does not adequately explain the difference between SYNCING and ACCEPTED for non-canonical branches.

For a block for a non-canonical branch to be ACCEPTED, it must be a "well-formed" chain of known ancestors.
Such a block with unknown ancestors cannot be ACCEPTED and must instead be signaled as SYNCING.

The purpose of ACCEPTED is to signal that there is in fact a well-formed chain but it has not yet been executed. ACCEPTED is not intended to also be for non-canonical branches missing requisite data.

The spec does not adequately explain the difference between SYNCING and ACCEPTED for non-canonical branches.

For a block for a non-canonical branch to be ACCEPTED, it must be a "well-formed" chain of known ancestors.
Such a block with unknown ancestors cannot be ACCEPTED and must instead be signalled as SYNCING.
@@ -259,7 +259,7 @@ The payload build process is specified as follows:
5. Client software **MUST** respond to this method call in the following way:
* `{status: INVALID_BLOCK_HASH, latestValidHash: null, validationError: errorMessage | null}` if the `blockHash` validation has failed
* `{status: INVALID_TERMINAL_BLOCK, latestValidHash: null, validationError: errorMessage | null}` if terminal block conditions are not satisfied
* `{status: SYNCING, latestValidHash: null, validationError: null}` if the payload extends the canonical chain and requisite data for its validation is missing
* `{status: SYNCING, latestValidHash: null, validationError: null}` if requisite data for the payload's acceptance or validation is missing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it worth adding something like "all ancestors of the payload are known and comprise a well-formed chain" to {status: ACCEPTED ... statement below. Otherwise, it's not clear that ACCEPTED may only be returned when a chain of payload's ancestors is known.

@sauliusgrigaitis
Copy link

I'm not sure this solves the problem. My understanding was that these different statuses should give root cause information for CLs. AFAIK SYNCING also may mean that requisite data is not missing but EL estimates the call to take significant time to process so it returns SYNCING instead of VALID/INVALID. So CL after SYNCING should wait for a while (how long?) and then if it's still SYNCING assume that EL is not busy processing and actually misses some requisite data. This guessing game is something developers want to avoid.

However, I doubt that these different statuses actually are useful. Seems that CL teams have general consensus that it's not worth handling these different statuses accordingly (because it's just too much of cases and even some ELs act differently in the same situation). Instead, it's just better to hope that optimistic sync eventually will resolve the situation. Actually, at Grandine we use the same philosophy and we try to handle the minimum number of statuses and just ignore the rest because otherwise we just bump into new edge cases.

A more general discussion - CL should drive EL. However, these different statuses actually make EL drive CL. In other words, if CL thinks that some payload is not needed to be validated (say because it's on some deeply forked chain with a low chance to reorg into) then it simply doesn't send the payload into EL. That's CL driving EL. But currently, it's the opposite, EL tells when it decides to not validate a given payload and there isn't much for CL to do with such a response.

@mkalinin
Copy link
Collaborator

mkalinin commented May 4, 2022

A more general discussion - CL should drive EL. However, these different statuses actually make EL drive CL. In other words, if CL thinks that some payload is not needed to be validated (say because it's on some deeply forked chain with a low chance to reorg into) then it simply doesn't send the payload into EL. That's CL driving EL. But currently, it's the opposite, EL tells when it decides to not validate a given payload and there isn't much for CL to do with such a response.

Only CL knows what is the head of canonical chain and EL must attempt to catch up with canonical chain no matter what. EL has liberty of not doing so for non-canonical branches but it doesn't give EL an ability to drive CL.

@sauliusgrigaitis
Copy link

EL has liberty of not doing so for non-canonical branches but it doesn't give EL an ability to drive CL.

I see this slightly differently. EL is able to respond with an ambiguous status that it's even not clear is it just taking time to answer CL's question on the validity of the non-canonical chain (the one that CL is considering switching to) or CL is actually missing the requisite information and actually is not doing anything. So this is the indirect driving of CL decisions.

@djrtwo
Copy link
Contributor Author

djrtwo commented May 4, 2022

AFAIK SYNCING also may mean that requisite data is not missing but EL estimates the call to take significant time to process so it returns SYNCING instead of VALID/INVALID

SYNCING doesn't mean that it will execute the data once it is found. It just means that the requisite data is not available.

If it is in the canonical chain, then it also will be executed but there is no guarantee that EL will execute non-canonical branches. This is purely optional

@sauliusgrigaitis
Copy link

SYNCING doesn't mean that it will execute the data once it is found. It just means that the requisite data is not available.

Maybe my English is broken but I didn't state the opposite. I wanted to say that EL may return SYNCING if it's actually executing the payload (requisites are not missing), but estimates processing to take longer than sub-second (or some close period IIRC from the discussions).

Anyway, this thread convinced me even more that there isn't much motivation to differently handle ACCEPTED and SYNCING in CLs. At least for the short term seems the best strategy is to simply handle VALID/INVALID and otherwise just move forward optimistically.

If the intention is that CLs should use these statuses, then these statuses would need to be reworked significantly, some ideas:

  1. Status should have extra information that precisely tells why the payload is not verified so that CL would be able to act accordingly (for example, if EL is missing some ancestor then it simply tells it to CL and maybe CL can supply it).
  2. If it's actually processing the payload then it may return with an estimate when processing should be completed or even would send an extra message to CL when processing is finished (in order to avoid polling).

@djrtwo
Copy link
Contributor Author

djrtwo commented May 8, 2022

I wanted to say that EL may return SYNCING if it's actually executing the payload (requisites are not missing), but estimates processing to take longer than sub-second (or some close period IIRC from the discussions).

This is not specified (EL aborting execution due to taking too long and saying "SYNCING).

EL aborting on execution on canonical chain is bad (could cause network splits depending on time and behaviour of various clients). CL should instead timeout and either insert children (if they come or retry if deemed valuable).

Whereas aborting on a-non canonical-branch would be more correctly as ACCEPTED -- that is, it is well formed but EL decided it wasn't worth executing.

@sauliusgrigaitis
Copy link

This is not specified (EL aborting execution due to taking too long and saying "SYNCING).

I find it really interesting how much miscommunication and confusion are in these statuses discussions :)

I did not mean aborting. I meant returning SYNCING immediately to the CL and actually proceeding with the payload validation on the EL side in order for CL to get a response later (by polling or making a blocking call that would return whenever validation is completed).

I think that easy to understand semantics would be:

  • SYNCING - EL is actually proceeding with the payload processing, but the estimate is too high so it returns immediately and CL can get the response later;
  • ACCEPTED - EL doesn't process the payload for whatever reason.

In this case, CL would easily benefit from these two basic responses even without any extra info. CL could poll EL again before the sensitive tasks if it got SYNCING. So this way CL has a chance to correctly vote on the latest head.

@djrtwo
Copy link
Contributor Author

djrtwo commented May 13, 2022

I'm going to go ahead and get this merged. It is sufficient for our current purposes and how the clints are using return values. If we want to further refine, lets take it to an issue

Copy link
Collaborator

@mkalinin mkalinin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@djrtwo djrtwo merged commit e69420f into main May 13, 2022
@djrtwo djrtwo deleted the djrtwo-patch-1 branch May 13, 2022 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants