Skip to content

Don't create child lookup if parent is faulty #7118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 5, 2025

Conversation

dapplion
Copy link
Collaborator

@dapplion dapplion commented Mar 12, 2025

Issue Addressed

Issue discovered on PeerDAS devnet (node lighthouse-geth-2.peerdas-devnet-5.ethpandaops.io). Summary:

  • A lookup is created for block root 0x28299de15843970c8ea4f95f11f07f75e76a690f9a8af31d354c38505eebbe12
  • That block or a parent is faulty and 0x28299de15843970c8ea4f95f11f07f75e76a690f9a8af31d354c38505eebbe12 is added to the failed chains cache
  • We later receive a block that is a child of a child of 0x28299de15843970c8ea4f95f11f07f75e76a690f9a8af31d354c38505eebbe12
  • We create a lookup, which attempts to process the child of 0x28299de15843970c8ea4f95f11f07f75e76a690f9a8af31d354c38505eebbe12 and hit a processor error UnknownParent, hitting this line

lookup.set_awaiting_parent(parent_root);
debug!(self.log, "Marking lookup as awaiting parent"; "id" => lookup.id, "block_root" => ?block_root, "parent_root" => ?parent_root);
self.search_parent_of_child(parent_root, block_root, &peers, cx);

search_parent_of_child does not create a parent lookup because the parent root is in the failed chain cache. However, we have already marked the child as awaiting the parent. This results in an inconsistent state of lookup sync, as there's a lookup awaiting a parent that doesn't exist.

Now we have a lookup (the child of 0x28299de15843970c8ea4f95f11f07f75e76a690f9a8af31d354c38505eebbe12) that is awaiting a parent lookup that doesn't exist: hence stuck.

Impact

This bug can affect Mainnet as well as PeerDAS devnets.

This bug may stall lookup sync for a few minutes (up to LOOKUP_MAX_DURATION_STUCK_SECS = 15 min) until the stuck prune routine deletes it. By that time the root will be cleared from the failed chain cache and sync should succeed. During that time the user will see a lot of WARN logs when attempting to add each peer to the inconsistent lookup. We may also sync the block through range sync if we fall behind by more than 2 epochs. We may also create the parent lookup successfully after the failed cache clears and complete the child lookup.

This bug is triggered if:

  • We have a lookup that fails and its root is added to the failed chain cache (much more likely to happen in PeerDAS networks)
  • We receive a block that builds on a child of the block added to the failed chain cache

Proposed Changes

Ensure that we never create (or leave existing) a lookup that references a non-existing parent.

I added must_use lints to the functions that create lookups. To fix the specific bug we must recursively drop the child lookup if the parent is not created. So if search_parent_of_child returns false now return LookupRequestError::Failed instead of LookupResult::Pending.

As a bonus I have a added more logging and reason strings to the errors

@dapplion dapplion requested a review from jxs as a code owner March 12, 2025 15:57
@dapplion dapplion added ready-for-review The code is ready for review syncing labels Mar 12, 2025
@dapplion
Copy link
Collaborator Author

dapplion commented Mar 12, 2025

I found other instances of the same issue. For example, lookup id: 62124, block_root: 0x48d5c54417da73dbec50da
63f8b4e04c206e6c62a2ff3ddf40e594d1732f1a24 also got stuck for the same reason and was eventually drop by the prune routine

beacon/logs/beacon.log:Mar 12 11:03:34.017 WARN Notify the devs a sync lookup is stuck, lookup: SingleBlockLookup { id: 62124, block_request_state: BlockRequestState { state: SingleLookupRequestState { state: AwaitingProcess(PeerGroup { peers
: {PeerId("16Uiu2HAm2vA4QzJmuKX6DmSHpDveaN4PbLxuTaCL2qM5gFBbPoYT"): [0]} }), failed_processing: 0, failed_downloading: 0 } }, component_requests: NotNeeded("no data"), peers: 2, block_root: 0x48d5c54417da73dbec50da63f8b4e04c206e6c62a2ff3ddf40
e594d1732f1a24, awaiting_parent: Some(0x5d0592ba3b1d4337ab5794a572ceaf57b4bfc7a94e76c5bd2ebe5d83eadd2123), created: Instant { tv_sec: 1735392, tv_nsec: 301631719 } }, block_root: 0x48d5c54417da73dbec50da63f8b4e04c206e6c62a2ff3ddf40e594d1732f1
a24, service: lookup_sync, service: sync, module: network::sync::block_lookups:897

@dapplion
Copy link
Collaborator Author

In lighthouse-geth-2.peerdas-devnet-5.ethpandaops.io there are a lot of instances of Parent lookup chain too long, justifying the addition to failed_chains

@jimmygchen jimmygchen requested a review from Copilot April 9, 2025 04:23
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (2)

beacon_node/network/src/sync/manager.rs:920

  • The variable 'parent_root' is not defined in this scope. Consider extracting it from 'block_component' before using it in the debug log.
debug!(?block_root, ?parent_root, "No lookup created for child and parent");

beacon_node/network/src/sync/manager.rs:928

  • The variable 'parent_root' is not available in this context. Extract and define it (e.g. via block_component.parent_root()) to ensure the debug log has access to the correct parent identifier.
debug!(%block_root, %parent_root, reason, "Ignoring unknown parent request");

@jimmygchen jimmygchen self-assigned this May 22, 2025
Copy link
Member

@jimmygchen jimmygchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Would you mind adding a regression test for this?

I made an attempt but didn't end up finishing it (and it didn't work), but I think a test would like this?

    // Regression test for https://github.com/sigp/lighthouse/pull/7118
    // Trigger an unknown parent block that references a previously failed chain and assert
    // that the lookup is not created.
    let mut child_block = rig.rand_block();
    *child_block.message_mut().parent_root_mut() = chain_hash;
    rig.trigger_unknown_parent_block(peer_id, child_block.into());
    rig.expect_no_active_lookups();

@jimmygchen jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels May 27, 2025
Copy link

mergify bot commented May 27, 2025

All required checks have passed and there are no merge conflicts. This pull request may now be ready for another review.

@mergify mergify bot added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels May 27, 2025
@jimmygchen jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels May 28, 2025
@jimmygchen
Copy link
Member

@dapplion I'll add a test for this - would like to get this merged.

@jimmygchen jimmygchen added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Jun 5, 2025
@jimmygchen
Copy link
Member

jimmygchen commented Jun 5, 2025

@dapplion Added a test that fails on unstable but passes with this fix.
Let me know if you're happy with this - and we can merge.

@jimmygchen jimmygchen assigned dapplion and unassigned jimmygchen Jun 5, 2025
// At this point, the chain should have been deemed too deep and pruned.
// The tip root should have been inserted into failed chains.
rig.assert_failed_chain(tip_root);
rig.expect_no_penalty_for(peer_id);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test follows current behaviour where we do NOT penalize peers for sending us long chains that maybe be canonical (honest behaviour).

// THEN: The extending block should not create a lookup because the tip was inserted into failed chains.
rig.expect_no_active_lookups();
// AND: The peer should be penalized for extending a failed chain.
rig.expect_single_penalty(peer_id, "failed_chain");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, we then penalize peers who do the same action. This is inconsistent and can lead to penalties for honest peers. We should not penalize peers who extend known long chains. We could have to hashsets

  • failed_chains: reject descendant lookups and penalize
  • long_chains: reject descendant lookups and don't penalize

We should fix this in a future PR

@dapplion
Copy link
Collaborator Author

dapplion commented Jun 5, 2025

✅ good to merge @jimmygchen

@jimmygchen jimmygchen added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Jun 5, 2025
mergify bot added a commit that referenced this pull request Jun 5, 2025
@mergify mergify bot merged commit d457cee into sigp:unstable Jun 5, 2025
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-merge This PR is ready to merge. syncing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants