Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement getBlobs and bypass blob gossip validation on successful blob retrievals from EL #6913

Open
wants to merge 20 commits into
base: unstable
Choose a base branch
from

Conversation

agnxsh
Copy link
Contributor

@agnxsh agnxsh commented Feb 10, 2025

ethereum/execution-apis#559

if EL supports engine_getBlobsV1

  • makes the block available as soon as the CL gets all the required blobs from the blob pool
  • populates blob quarantine with that info (should probably verify proofs and inclusion once, like gossip?)
  • pops from blob quarantine and add blobs to the block processor

else

  • does everything the conventional way

Note: getBlobs in nimbus-eth2 here is made to be backward compatible from Electra and NOT Deneb (Deneb is what the spec suggests)

Pros

  • doesn't call kzg proof verification and inclusion proof verification for 9 blobs per slot (assuming the EL can be trusted)
  • faster block import (in happy cases)

Cons

  • relying heavily on the EL mempool can cause threats of various sorts, however, our block processor runs another kzg proof check before persisting the block and blobs, in case of a failure, the RequestManager should be able to fetch back the otherwise malicious blob/proof data sent by the bad EL, EL can be quickly swapped, Nimbus would rely on blob gossip that entire duration and switch back to optimistically pulling blobs from EL.

CI failing: because of some nim-web3 upstream incompatibilities

Copy link

github-actions bot commented Feb 10, 2025

Unit Test Results

       15 files  ±0    2 630 suites  ±0   1h 15m 11s ⏱️ + 1m 10s
  6 436 tests ±0    5 915 ✔️ ±0  521 💤 ±0  0 ±0 
44 766 runs  ±0  44 048 ✔️ ±0  718 💤 ±0  0 ±0 

Results for commit 36c848b. ± Comparison against base commit f1e6eeb.

♻️ This comment has been updated with latest results.

@agnxsh
Copy link
Contributor Author

agnxsh commented Feb 11, 2025

for devnet testing locally, use ethpandaops/nimbus-eth2:getBlobs

when consensusFork >= ConsensusFork.Electra:
# Pull blobs and proofs from the EL blob pool
let blobsFromElOpt = await node.elManager.sendGetBlobs(forkyBlck)
debugEcho "pulled blobs from el"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obligatory comment about debugEcho (but yeah, it's a draft PR and useful for now)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for now, until i see things are ok in the local devnet

@agnxsh agnxsh force-pushed the getBlobs branch 3 times, most recently from 6ad1b3d to 03e11ee Compare February 14, 2025 06:00
@agnxsh
Copy link
Contributor Author

agnxsh commented Feb 14, 2025

image

2025-02-14 13:24:27 INF 2025-02-14 07:54:27.006+00:00 Slot end                                   topics="beacnde" slot=234 nextActionWait=1s993ms123us121ns nextAttestationSlot=235 nextProposalSlot=236 syncCommitteeDuties=current head=4932c763:234
2025-02-14 13:24:28 WRN 2025-02-14 07:54:28.713+00:00 Peer count low, no new peers discovered    topics="networking" discovered_nodes=0 new_peers=@[] current_peers=1 wanted_peers=160
2025-02-14 13:24:29 INF 2025-02-14 07:54:28.999+00:00 Slot start                                 topics="beacnde" nextFork=Fulu:100000001 head=4932c763:234 delay=137us922ns finalized=5:5177fdfd peers=1 slot=235 sync=synced epoch=7
2025-02-14 13:24:29 pulled blobs from el
2025-02-14 13:24:29 9
2025-02-14 13:24:31 INF 2025-02-14 07:54:31.295+00:00 Attestation sent                           topics="beacval" attestation="(committee_index: 0, attester_index: 28, data: (slot: 235, index: 0, beacon_block_root: \"61632bd0\", source: \"6:f956b1f3\", target: \"7:a67fac0a\"), signature: \"929f5919\")" delay=-1s703ms988us327ns subnet_id=11
2025-02-14 13:24:31 INF 2025-02-14 07:54:31.295+00:00 Attestation sent                           topics="beacval" attestation="(committee_index: 0, attester_index: 52, data: (slot: 235, index: 0, beacon_block_root: \"61632bd0\", source: \"6:f956b1f3\", target: \"7:a67fac0a\"), signature: \"b5d6aa5c\")" delay=-1s703ms537us202ns subnet_id=11
2025-02-14 13:24:31 INF 2025-02-14 07:54:31.296+00:00 Sync committee message sent 

we're finalising with nimbus running getBlobs in a devnet.

@agnxsh
Copy link
Contributor Author

agnxsh commented Feb 14, 2025

we can get some better decision making on whether to validate blob gossip or pull blobs from EL every slot, by caching engine_exchangeCapabilities, however a good interval should be decided as ELs can be swapped on the run.

@agnxsh agnxsh marked this pull request as ready for review February 16, 2025 21:03
@agnxsh agnxsh changed the title first draft of getBlobs implement getBlobs and bypass blob gossip validation on successful blob retrievals from EL Feb 16, 2025
@agnxsh
Copy link
Contributor Author

agnxsh commented Mar 3, 2025

https://discordapp.com/channels/595666850260713488/1252403418941624532/1341996653149945927

an interesting mainnet research from Jimmy (Lighthouse) suggests, that there are 0-8 blob misses per minute by the EL, hence we shall only bypass gossip validation for blobs completely when the EL can serve a complete response (i.e, blobs from EL == kzg comms len).

an alternative can be done, where for every block we can fetch most blobs from the EL, and the rest from gossip val, but in that case, we'd likely spend some time checking what we missed from EL, which isn't very ideal.

image
grafana for ref

let block_root = hash_tree_root(block_header)

let vEl =
await self.validateBlobSidecarFromEL(block_root)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like in the case of a pathological EL, this could delay things by 1 second?

@agnxsh agnxsh force-pushed the getBlobs branch 2 times, most recently from 7a4060a to f1d736e Compare March 31, 2025 08:00
el_blob_loss = forkyBlck.message.body.blob_kzg_commitments.len - blobsEl.len

debug "Time taken to receive partially response from EL",
received_percent = float((blobsEl.len div forkyBlck.message.body.blob_kzg_commitments.len) * 100),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

div is integer division; this doesn't do what it seems to want, because the float conversion is pointless. It will return 0 all the time (or 100)


el_blob_loss = forkyBlck.message.body.blob_kzg_commitments.len - blobsEl.len

debug "Time taken to receive partially response from EL",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial response (or partially receive response, whichever seems more appropriate, but the "receive" here seems done and not in progress, so I'd lean toward former)

if blobsEl.len == forkyBlck.message.body.blob_kzg_commitments.len:

# we have got all blobs from EL, now we can
# conveniently the blobless block from quarantine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verb?

@@ -149,9 +157,41 @@ type

ValidationRes* = Result[void, ValidationError]


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

@@ -2095,6 +2098,33 @@ proc installMessageValidators(node: BeaconNode) =
await node.processor.processBlsToExecutionChange(
MsgSource.gossip, msg)))

when consensusFork >= ConsensusFork.Electra:
# blob_sidecar_{subnet_id}
# https://github.com/ethereum/consensus-specs/blob/v1.4.0-beta.5/specs/deneb/p2p-interface.md#blob_sidecar_subnet_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newly introduced URLs should be current version (currently v1.5.0-beta.3)


for idx, req in requests:
if not(req.finished()):
warn "Timeout while getting blob and proof",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really warn? It's not necessarily a Nimbus error, and it shouldn't disrupt Nimbus functioning because of the race. So one could end up with a scary-looking WRN but no malfunction.

second = if raced == f1b: f2 else: f1

try:
let res1 = await first
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  ## Returns a future which will complete and return finished FutureBase,
  ## when one of the given futures will be completed, failed or canceled.
  ##
  ## On success returned Future will hold finished FutureBase.
  ##
  ## On cancel futures in ``futs`` WILL NOT BE cancelled.

under what circumstance should first have to be awaited?

Explicitly:

Returns a future which will complete and return finished FutureBase

That is, one should just be able to call complete() on it? cc @cheatfate

This might not apply to await second, because that could still be in progress, but by definition the winner of the race should be completeable.

if (let o = self.quarantine[].getBlobless(block_root); o.isSome):
let blobless = o.get()
withBlck(blobless):
when consensusFork >= ConsensusFork.Electra:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this isn't true, it's a bug, but also kind of falls off the edge of the control flow of withBlck(blobless) and implicitly returns some default thing. Better to handle this case explicitly somehow and flag it as an error (doAssert, even, if it's really a logic error to enter this function with a pre-Electra block, but only if that can be actually guaranteed, otherwise some usual runtime error).

else:
node.dag.cfg.BLOB_SIDECAR_SUBNET_COUNT
# It's safe to try as many times to fetch EL blobs as
# there are blob subnets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be more blobs than blob subnets, so for any given block, each subnet might be triggered multiple times. This doesn't happen in Dencun and won't in Pectra, but in theory it could, so the comment at least is misleading.

The code structure works fine for this -- it will just trigger the toValidationRace(...) per blob, not per blob subnet.

toValidationResult(
node.processor[].processBlobSidecar(
): Future[ValidationResult] {.async: (raises: [CancelledError]).} =
return toValidationResult(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chronos doesn't' need the explicit return anymore for async functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants