-
Notifications
You must be signed in to change notification settings - Fork 906
[Merged by Bors] - Reduce false positive logging for late builder blocks #4073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, nice improvement.
Will need a few sprinklings of into_response()
to resolve the conflict with my http_api
refactor PR.
Co-authored-by: Michael Sproul <micsproul@gmail.com>
3826037
to
bcecf95
Compare
Oops, I messed up an attempted sneaky-rebase and had to do a public rebase! |
bors r+ |
## Issue Addressed NA ## Proposed Changes When producing a block from a builder, there are two points where we could consider the block "broadcast": 1. When the blinded block is published to the builder. 2. When the un-blinded block is published to the P2P network (this is always *after* the previous step). Our logging for late block broadcasts was using (2) for builder-blocks, which was creating a lot of false-positive logs. This is because the builder publishes the block on the P2P network themselves before returning it to us and we perform (2). For clarity, the logs were false-positives because we claim that the block was published late by us when it was actually published earlier by the builder. This PR changes our logging behavior so we do our logging at (1) instead. It also updates our metrics for block broadcast to distinguish between local and builder blocks. I believe the metrics change will be natively compatible with existing Grafana dashboards. ## Additional Info One could argue that the builder *should* return the block to us faster, however that's not the case. I think it's more important that we don't desensitize users with false-positives.
Build failed (retrying...): |
## Issue Addressed NA ## Proposed Changes When producing a block from a builder, there are two points where we could consider the block "broadcast": 1. When the blinded block is published to the builder. 2. When the un-blinded block is published to the P2P network (this is always *after* the previous step). Our logging for late block broadcasts was using (2) for builder-blocks, which was creating a lot of false-positive logs. This is because the builder publishes the block on the P2P network themselves before returning it to us and we perform (2). For clarity, the logs were false-positives because we claim that the block was published late by us when it was actually published earlier by the builder. This PR changes our logging behavior so we do our logging at (1) instead. It also updates our metrics for block broadcast to distinguish between local and builder blocks. I believe the metrics change will be natively compatible with existing Grafana dashboards. ## Additional Info One could argue that the builder *should* return the block to us faster, however that's not the case. I think it's more important that we don't desensitize users with false-positives.
Build failed (retrying...): |
I think this might have broken block unblinding (based on the CI failures). Haven't looked in detail yet |
## Issue Addressed NA ## Proposed Changes When producing a block from a builder, there are two points where we could consider the block "broadcast": 1. When the blinded block is published to the builder. 2. When the un-blinded block is published to the P2P network (this is always *after* the previous step). Our logging for late block broadcasts was using (2) for builder-blocks, which was creating a lot of false-positive logs. This is because the builder publishes the block on the P2P network themselves before returning it to us and we perform (2). For clarity, the logs were false-positives because we claim that the block was published late by us when it was actually published earlier by the builder. This PR changes our logging behavior so we do our logging at (1) instead. It also updates our metrics for block broadcast to distinguish between local and builder blocks. I believe the metrics change will be natively compatible with existing Grafana dashboards. ## Additional Info One could argue that the builder *should* return the block to us faster, however that's not the case. I think it's more important that we don't desensitize users with false-positives.
Build failed: |
.try_into_full_block(Some(full_payload)) | ||
.map(Arc::new) | ||
.map(ProvenancedBlock::Builder), | ||
None => None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be what's causing issues in CI, with pre-merge blocks this will return None
here causing the Unable to add payload to block
error where before calling block.try_into_full_block(None)
on a pre-merge block would still return the block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooh, super shoddy oversight on my behalf. Thanks for point this out and thanks tests!
Fixed in 408af7f.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lookin' good!
bors r+ |
## Issue Addressed NA ## Proposed Changes When producing a block from a builder, there are two points where we could consider the block "broadcast": 1. When the blinded block is published to the builder. 2. When the un-blinded block is published to the P2P network (this is always *after* the previous step). Our logging for late block broadcasts was using (2) for builder-blocks, which was creating a lot of false-positive logs. This is because the builder publishes the block on the P2P network themselves before returning it to us and we perform (2). For clarity, the logs were false-positives because we claim that the block was published late by us when it was actually published earlier by the builder. This PR changes our logging behavior so we do our logging at (1) instead. It also updates our metrics for block broadcast to distinguish between local and builder blocks. I believe the metrics change will be natively compatible with existing Grafana dashboards. ## Additional Info One could argue that the builder *should* return the block to us faster, however that's not the case. I think it's more important that we don't desensitize users with false-positives.
## Issue Addressed NA ## Proposed Changes Downgrade a `CRIT` to an `ERRO` when there's an `Irrecoverable` error whilst publishing a blinded block. It's quite common for builders successfully broadcast a block to the network whilst failing to respond to the BN when it publishes a signed, blinded block. The VC is currently raising a `CRIT` when this happens and I think that's excessive. These changes have the same intent as #4073. In that PR I only managed to remove the `CRIT`s in the BN but missed this one in the VC. I've also tidied the log messages to: - Give them all the same title (*"Error whilst producing block"*) to help with grepping. - Include the `block_slot` so it's easy to look up the slot in an explorer and see if it was actually skipped. ## Additional Info This PR should not change any logic beyond logging.
## Issue Addressed NA ## Proposed Changes Downgrade a `CRIT` to an `ERRO` when there's an `Irrecoverable` error whilst publishing a blinded block. It's quite common for builders successfully broadcast a block to the network whilst failing to respond to the BN when it publishes a signed, blinded block. The VC is currently raising a `CRIT` when this happens and I think that's excessive. These changes have the same intent as #4073. In that PR I only managed to remove the `CRIT`s in the BN but missed this one in the VC. I've also tidied the log messages to: - Give them all the same title (*"Error whilst producing block"*) to help with grepping. - Include the `block_slot` so it's easy to look up the slot in an explorer and see if it was actually skipped. ## Additional Info This PR should not change any logic beyond logging.
## Issue Addressed NA ## Proposed Changes Downgrade a `CRIT` to an `ERRO` when there's an `Irrecoverable` error whilst publishing a blinded block. It's quite common for builders successfully broadcast a block to the network whilst failing to respond to the BN when it publishes a signed, blinded block. The VC is currently raising a `CRIT` when this happens and I think that's excessive. These changes have the same intent as #4073. In that PR I only managed to remove the `CRIT`s in the BN but missed this one in the VC. I've also tidied the log messages to: - Give them all the same title (*"Error whilst producing block"*) to help with grepping. - Include the `block_slot` so it's easy to look up the slot in an explorer and see if it was actually skipped. ## Additional Info This PR should not change any logic beyond logging.
## Issue Addressed NA ## Proposed Changes Downgrade a `CRIT` to an `ERRO` when there's an `Irrecoverable` error whilst publishing a blinded block. It's quite common for builders successfully broadcast a block to the network whilst failing to respond to the BN when it publishes a signed, blinded block. The VC is currently raising a `CRIT` when this happens and I think that's excessive. These changes have the same intent as sigp#4073. In that PR I only managed to remove the `CRIT`s in the BN but missed this one in the VC. I've also tidied the log messages to: - Give them all the same title (*"Error whilst producing block"*) to help with grepping. - Include the `block_slot` so it's easy to look up the slot in an explorer and see if it was actually skipped. ## Additional Info This PR should not change any logic beyond logging.
## Issue Addressed NA ## Proposed Changes Downgrade a `CRIT` to an `ERRO` when there's an `Irrecoverable` error whilst publishing a blinded block. It's quite common for builders successfully broadcast a block to the network whilst failing to respond to the BN when it publishes a signed, blinded block. The VC is currently raising a `CRIT` when this happens and I think that's excessive. These changes have the same intent as sigp#4073. In that PR I only managed to remove the `CRIT`s in the BN but missed this one in the VC. I've also tidied the log messages to: - Give them all the same title (*"Error whilst producing block"*) to help with grepping. - Include the `block_slot` so it's easy to look up the slot in an explorer and see if it was actually skipped. ## Additional Info This PR should not change any logic beyond logging.
NA When producing a block from a builder, there are two points where we could consider the block "broadcast": 1. When the blinded block is published to the builder. 2. When the un-blinded block is published to the P2P network (this is always *after* the previous step). Our logging for late block broadcasts was using (2) for builder-blocks, which was creating a lot of false-positive logs. This is because the builder publishes the block on the P2P network themselves before returning it to us and we perform (2). For clarity, the logs were false-positives because we claim that the block was published late by us when it was actually published earlier by the builder. This PR changes our logging behavior so we do our logging at (1) instead. It also updates our metrics for block broadcast to distinguish between local and builder blocks. I believe the metrics change will be natively compatible with existing Grafana dashboards. One could argue that the builder *should* return the block to us faster, however that's not the case. I think it's more important that we don't desensitize users with false-positives.
## Issue Addressed NA ## Proposed Changes Downgrade a `CRIT` to an `ERRO` when there's an `Irrecoverable` error whilst publishing a blinded block. It's quite common for builders successfully broadcast a block to the network whilst failing to respond to the BN when it publishes a signed, blinded block. The VC is currently raising a `CRIT` when this happens and I think that's excessive. These changes have the same intent as sigp#4073. In that PR I only managed to remove the `CRIT`s in the BN but missed this one in the VC. I've also tidied the log messages to: - Give them all the same title (*"Error whilst producing block"*) to help with grepping. - Include the `block_slot` so it's easy to look up the slot in an explorer and see if it was actually skipped. ## Additional Info This PR should not change any logic beyond logging.
Issue Addressed
NA
Proposed Changes
When producing a block from a builder, there are two points where we could consider the block "broadcast":
Our logging for late block broadcasts was using (2) for builder-blocks, which was creating a lot of false-positive logs. This is because the builder publishes the block on the P2P network themselves before returning it to us and we perform (2). For clarity, the logs were false-positives because we claim that the block was published late by us when it was actually published earlier by the builder.
This PR changes our logging behavior so we do our logging at (1) instead. It also updates our metrics for block broadcast to distinguish between local and builder blocks. I believe the metrics change will be natively compatible with existing Grafana dashboards.
Additional Info
One could argue that the builder should return the block to us faster, however that's not the case. I think it's more important that we don't desensitize users with false-positives.