Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Consensus 2.0] Additional metrics & logs #18075

Merged
merged 3 commits into from
Jun 6, 2024

Conversation

akichidis
Copy link
Contributor

Description

More metrics and logs related to block receive/acceptance/commit , subscription etc

Test plan

CI


Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

  • Protocol:
  • Nodes (Validators and Full nodes):
  • Indexer:
  • JSON-RPC:
  • GraphQL:
  • CLI:
  • Rust SDK:

Copy link

vercel bot commented Jun 5, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
multisig-toolkit ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 6, 2024 7:02am
sui-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 6, 2024 7:02am
sui-kiosk ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 6, 2024 7:02am
sui-typescript-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 6, 2024 7:02am

.subscriber_connections
.with_label_values(&[peer_hostname, "inbound"])
.set(0);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to test how reliable this approach is, but overall it would be nice to have metrics for the subscribers to "our" node

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I think we should use separate metrics for inbound and outbound though so the metric as a whole makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting them makes sense 👍

@@ -57,6 +58,9 @@ pub(crate) struct BlockManager {
/// Keeps all the blocks that we actually miss and haven't fetched them yet. That set will basically contain all the
/// keys from the `missing_ancestors` minus any keys that exist in `suspended_blocks`.
missing_blocks: BTreeSet<BlockRef>,
/// A vector that holds a tuple of (lowest_round, highest_round) of received blocks per authority.
/// This is used for metrics reporting purposes and resets during restarts.
received_block_rounds: Vec<(Round, Round)>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use Vec<Option<(Round, Round)>> here. Initializing lowest to 0 disables the field, and initializing highest to 0 can lead to metric anomaly as well.

let block = match self.try_accept_one_block(block) {
TryAcceptResult::Accepted(block) => block,
TryAcceptResult::Suspended(ancestors_to_fetch) => {
debug!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to check in private testnet on the log volume. Maybe this can be trace instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

self.context
.metrics
.node_metrics
.last_committed_authority_round
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching to use last committed round per authority from DagState. That should report more consistent data across restarts, especially for validators that stop creating blocks.

self.context
.metrics
.node_metrics
.committed_blocks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consensus_committed_messages should report the same data and it is consistent in an epoch across restarts.

.subscriber_connections
.with_label_values(&[peer_hostname, "inbound"])
.set(0);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I think we should use separate metrics for inbound and outbound though so the metric as a whole makes sense.

@@ -119,6 +112,13 @@ impl<C: NetworkClient, S: NetworkService> Subscriber<C, S> {
let mut retries: i64 = 0;
let mut delay = INITIAL_RETRY_INTERVAL;
'subscription: loop {
context
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix.

@mwtian mwtian force-pushed the akichidis/add-metrics-and-logs-1 branch from 373e833 to ff2ac36 Compare June 5, 2024 22:32
@mwtian mwtian force-pushed the akichidis/add-metrics-and-logs-1 branch from ff2ac36 to 36252ad Compare June 5, 2024 22:43
@mwtian mwtian force-pushed the akichidis/add-metrics-and-logs-1 branch from 36252ad to 837f4a5 Compare June 5, 2024 22:47
synchronizer_fetched_blocks_by_peer: register_int_counter_vec_with_registry!(
"synchronizer_fetched_blocks_by_peer",
"Number of fetched blocks per peer authority via the synchronizer and also by block authority",
&["peer", "type"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cardinality of # peers * # authorities could be just too large (>10000). We can test in private testnet but for now I'm splitting them into two metrics.

@akichidis akichidis merged commit 12aa1d1 into main Jun 6, 2024
47 checks passed
@akichidis akichidis deleted the akichidis/add-metrics-and-logs-1 branch June 6, 2024 17:45
arun-koshy pushed a commit that referenced this pull request Jun 6, 2024
More metrics and logs related to block receive/acceptance/commit ,
subscription etc

CI

---

Check each box that your changes affect. If none of the boxes relate to
your changes, release notes aren't required.

For each box you select, include information after the relevant heading
that describes the impact of your changes that a user might notice and
any actions they must take to implement updates.

- [ ] Protocol:
- [ ] Nodes (Validators and Full nodes):
- [ ] Indexer:
- [ ] JSON-RPC:
- [ ] GraphQL:
- [ ] CLI:
- [ ] Rust SDK:

---------

Co-authored-by: MW Tian <mingwei@mystenlabs.com>
tx-tomcat pushed a commit to tx-tomcat/sui-network that referenced this pull request Jul 29, 2024
## Description 

More metrics and logs related to block receive/acceptance/commit ,
subscription etc

## Test plan 

CI

---

## Release notes

Check each box that your changes affect. If none of the boxes relate to
your changes, release notes aren't required.

For each box you select, include information after the relevant heading
that describes the impact of your changes that a user might notice and
any actions they must take to implement updates.

- [ ] Protocol: 
- [ ] Nodes (Validators and Full nodes): 
- [ ] Indexer: 
- [ ] JSON-RPC: 
- [ ] GraphQL: 
- [ ] CLI: 
- [ ] Rust SDK:

---------

Co-authored-by: MW Tian <mingwei@mystenlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants