Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expensive fork-choice queued attestation mutation #6206

Closed
dapplion opened this issue Jul 30, 2024 · 7 comments
Closed

Expensive fork-choice queued attestation mutation #6206

dapplion opened this issue Jul 30, 2024 · 7 comments
Labels
bug Something isn't working optimization Something to make Lighthouse run more efficiently. v5.3.0 Q3 2024 release with database changes!

Comments

@dapplion
Copy link
Collaborator

Description

Debugging a Holesky node we noted that the split_off here was moving ~1 GB of data every time we process an attestation

let remaining = queued_attestations.split_off(

Questions:

  1. Why did the queued attestations Vec grow so large?
  2. Why is it being mutated every call?
  3. Do we have to use a Vec for this?

Tackling 2 and 3, we could switch to a HashMap<Slot, Vec<QueuedAttestation>> as time strictly advances forward, so each slot Vec is strictly append-only.

WIP of this approach: https://github.com/sigp/lighthouse/compare/stable...dapplion:lighthouse:fork-choice-queued-attestations?expand=1

However we should answer 1 too as there may be some other lingering issue

@michaelsproul
Copy link
Member

Nice detective work @jimmygchen !

@michaelsproul michaelsproul added bug Something isn't working v5.3.0 Q3 2024 release with database changes! optimization Something to make Lighthouse run more efficiently. labels Jul 30, 2024
@jimmygchen
Copy link
Member

I've done some further investigation and noticed an invalid attestation (bad data.slot) in fork choice, causing this strange behaviour. However I think it's not possible for this to happen as the validation here prevents this invalid attestation to be queued at all.

I think it might be worth running a memtest on the machine.

@jimmygchen
Copy link
Member

jimmygchen commented Jul 31, 2024

Further investigation shows that this cannot happen (see the link above) and @michaelsproul found that the bad attestation's slot actually has 1 bit flipped compared to its correct slot, so it could very well be a memory / disk issue.

I couldn't get anything from memtests on that machine. I think we could deploy v5.3.0 to more testnet nodes and see if it happens elsewhere (including slasher), if it doesn't we could probably conclude this bug to be likely a hardware issue, and move this issue out of v5.3.0.

@michaelsproul
Copy link
Member

If this happens more often with nodes that have been running slashers, maybe it is some UB in the slasher code, potentially the C database code!

We could try running a slasher node under valgrind

@jimmygchen
Copy link
Member

This undefined behaviour might have surfaced since the recent refactor in #4529, which revealed a bug in the LMDB bindings. See #6211 for a workaround fix.

Under normal circumstances the queued attestations vec should be quite small as there are validations to prevent attestations for future slots to be queued. It would be useful to add a metric for this though.

@dapplion
Copy link
Collaborator Author

dapplion commented Aug 1, 2024

@jimmygchen this PR adds a metric for the queued attestations vec:

Closing this issue as won't fix

@dapplion dapplion closed this as not planned Won't fix, can't repro, duplicate, stale Aug 1, 2024
@michaelsproul
Copy link
Member

For reference, the metric to check for this issue is beacon_fork_choice_process_attestation_seconds_*. Processing attestations in fork choice should be very quick (<5ms per attestation), but if this issue is occurring then that metric will blow way up (we've seen it >100ms per attestation).

cc @chong-he

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working optimization Something to make Lighthouse run more efficiently. v5.3.0 Q3 2024 release with database changes!
Projects
None yet
Development

No branches or pull requests

3 participants