Skip to content

Conversation

realbigsean
Copy link
Member

Issue Addressed

I observed our forward sync on devnet 9 would stall when we would hit this log:

250425:Oct 19 00:54:17.133 WARN Blocks and blobs request for range received invalid data, error: KzgCommitmentMismatch, batch_id: 4338, peer_id: 16Uiu2HAmHbmkEQFDrJfNuy1aYyAfHkNUwSD9FN7EVAqGJ8YTF9Mh, service: sync, module: network::sync::manager:1036

Proposed Changes

range_sync_block_and_blob_response here removes the request from the sync manager. later, however if there's an error, inject_error here expects the request to exist so we can handle retry logic. So this PR just re-inserts the request (withthout any accumulated blobs or blocks) when we hit an error here.

The issue is unique to block+blob sync because the error here is only possible from mismatches between blocks + blobs after we've downloaded both, there's no equivalent error in block sync

@realbigsean realbigsean added ready-for-review The code is ready for review deneb labels Oct 20, 2023
@realbigsean realbigsean added the bug Something isn't working label Oct 20, 2023
bors bot pushed a commit that referenced this pull request Oct 25, 2023
1. Add commitments to logs and update the `Display` implementation of `KzgCommitment` to become truncated similarly to block root.

I've been finding it difficult to debug scenarios involving multiple blobs for the same `(index, block_root)`. Logging the commitment will help with this, we can match it to what exists in the block. 

Example output:

```
Oct 20 21:13:36.700 DEBG Successfully verified gossip blob       commitment: 0xa3c1…1cd8, index: 0, root: 0xf31e…f9de, slot: 154568
Oct 20 21:13:36.785 DEBG Successfully verified gossip block      commitments: [0xa3c1…1cd8, 0x8655…02ff, 0x8d6a…955a, 0x84ac…3a1b, 0x9752…629b, 0xb9fc…20fb], root: 0xf31eeb732702e429e89057b15e1c0c631e8452e09e03cb1924353f536ef4f9de, slot: 154568, graffiti: teku/besu, service: beacon
```

Example output in a block with no blobs (this will show up pre-deneb):
```
426734:Oct 20 21:15:24.113 DEBG Successfully verified gossip block, commitments: [], root: 0x619db1360ba0e8d44ae2a0f2450ebca47e167191feecffcfac0e8d7b6c39623c, slot: 154577, graffiti: teku/nethermind, service: beacon, module: beacon_chain::beacon_chain:2765
```

2. Remove `strum::IntoStaticStr` from `AvailabilityCheckError`. This is because `IntoStaticStr` end up dropping information inside the enum. So kzg commitments in this error are dropped, making it more difficult to debug
```
AvailabilityCheckError::KzgCommitmentMismatch {
        blob_commitment: KzgCommitment,
        block_commitment: KzgCommitment,
    },
```
which is output as just `AvailabilityCheckError`

3. Some additional misc sync logs I found useful in debugging #4869

4. This downgrades ”Block returned for single block lookup not present” to debug because I don’t think we can fix the scenario that causes this unless we can cancel inflight rpc requests

Co-authored-by: realbigsean <seananderson33@gmail.com>
bors bot pushed a commit that referenced this pull request Oct 25, 2023
1. Add commitments to logs and update the `Display` implementation of `KzgCommitment` to become truncated similarly to block root.

I've been finding it difficult to debug scenarios involving multiple blobs for the same `(index, block_root)`. Logging the commitment will help with this, we can match it to what exists in the block. 

Example output:

```
Oct 20 21:13:36.700 DEBG Successfully verified gossip blob       commitment: 0xa3c1…1cd8, index: 0, root: 0xf31e…f9de, slot: 154568
Oct 20 21:13:36.785 DEBG Successfully verified gossip block      commitments: [0xa3c1…1cd8, 0x8655…02ff, 0x8d6a…955a, 0x84ac…3a1b, 0x9752…629b, 0xb9fc…20fb], root: 0xf31eeb732702e429e89057b15e1c0c631e8452e09e03cb1924353f536ef4f9de, slot: 154568, graffiti: teku/besu, service: beacon
```

Example output in a block with no blobs (this will show up pre-deneb):
```
426734:Oct 20 21:15:24.113 DEBG Successfully verified gossip block, commitments: [], root: 0x619db1360ba0e8d44ae2a0f2450ebca47e167191feecffcfac0e8d7b6c39623c, slot: 154577, graffiti: teku/nethermind, service: beacon, module: beacon_chain::beacon_chain:2765
```

2. Remove `strum::IntoStaticStr` from `AvailabilityCheckError`. This is because `IntoStaticStr` end up dropping information inside the enum. So kzg commitments in this error are dropped, making it more difficult to debug
```
AvailabilityCheckError::KzgCommitmentMismatch {
        blob_commitment: KzgCommitment,
        block_commitment: KzgCommitment,
    },
```
which is output as just `AvailabilityCheckError`

3. Some additional misc sync logs I found useful in debugging #4869

4. This downgrades ”Block returned for single block lookup not present” to debug because I don’t think we can fix the scenario that causes this unless we can cancel inflight rpc requests

Co-authored-by: realbigsean <seananderson33@gmail.com>
Copy link
Member

@ethDreamer ethDreamer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Not that I understand sync all that well but this seems to make sense. Lighthouse was continuously stalling yesterday when syncing devnet-10 and then pawan pointed me to this PR. Unfortunately devnet-10 was shut down by the time I tried this PR this morning but lighthouse does seem to sync fine on devnet-11. This devnet is much newer though so we might have to wait a bit before we can hit this bug again on devnet-11 and retry..

@realbigsean
Copy link
Member Author

bors r+

bors bot pushed a commit that referenced this pull request Oct 31, 2023
## Issue Addressed

I observed our forward sync on devnet 9 would stall when we would hit this log:
```
250425:Oct 19 00:54:17.133 WARN Blocks and blobs request for range received invalid data, error: KzgCommitmentMismatch, batch_id: 4338, peer_id: 16Uiu2HAmHbmkEQFDrJfNuy1aYyAfHkNUwSD9FN7EVAqGJ8YTF9Mh, service: sync, module: network::sync::manager:1036
```

## Proposed Changes

`range_sync_block_and_blob_response` [here](https://github.com/sigp/lighthouse/blob/1cb02a13a53d0e603ad5920c03832e5779c3df61/beacon_node/network/src/sync/manager.rs#L1013) removes the request from the sync manager. later, however if there's an error, `inject_error` [here](https://github.com/sigp/lighthouse/blob/1cb02a13a53d0e603ad5920c03832e5779c3df61/beacon_node/network/src/sync/manager.rs#L1055) expects the request to exist so we can handle retry logic. So this PR just re-inserts the request (withthout any accumulated blobs or blocks) when we hit an error here.

The issue is unique to block+blob sync because the error here is only possible from mismatches between blocks + blobs after we've downloaded both, there's no equivalent error in block sync



Co-authored-by: realbigsean <seananderson33@gmail.com>
Copy link

bors bot commented Oct 31, 2023

Build failed:

@realbigsean
Copy link
Member Author

bors retry

bors bot pushed a commit that referenced this pull request Oct 31, 2023
## Issue Addressed

I observed our forward sync on devnet 9 would stall when we would hit this log:
```
250425:Oct 19 00:54:17.133 WARN Blocks and blobs request for range received invalid data, error: KzgCommitmentMismatch, batch_id: 4338, peer_id: 16Uiu2HAmHbmkEQFDrJfNuy1aYyAfHkNUwSD9FN7EVAqGJ8YTF9Mh, service: sync, module: network::sync::manager:1036
```

## Proposed Changes

`range_sync_block_and_blob_response` [here](https://github.com/sigp/lighthouse/blob/1cb02a13a53d0e603ad5920c03832e5779c3df61/beacon_node/network/src/sync/manager.rs#L1013) removes the request from the sync manager. later, however if there's an error, `inject_error` [here](https://github.com/sigp/lighthouse/blob/1cb02a13a53d0e603ad5920c03832e5779c3df61/beacon_node/network/src/sync/manager.rs#L1055) expects the request to exist so we can handle retry logic. So this PR just re-inserts the request (withthout any accumulated blobs or blocks) when we hit an error here.

The issue is unique to block+blob sync because the error here is only possible from mismatches between blocks + blobs after we've downloaded both, there's no equivalent error in block sync



Co-authored-by: realbigsean <seananderson33@gmail.com>
Copy link

bors bot commented Oct 31, 2023

Pull request successfully merged into unstable.

Build succeeded!

The publicly hosted instance of bors-ng is deprecated and will go away soon.

If you want to self-host your own instance, instructions are here.
For more help, visit the forum.

If you want to switch to GitHub's built-in merge queue, visit their help page.

@bors bors bot changed the title fix deneb sync bug [Merged by Bors] - fix deneb sync bug Oct 31, 2023
@bors bors bot closed this Oct 31, 2023
@realbigsean realbigsean deleted the fix-deneb-sync-bug branch November 21, 2023 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working deneb ready-for-review The code is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants