Skip to content
This repository has been archived by the owner on Jan 22, 2025. It is now read-only.

report udp stats from validator #20587

Merged
merged 12 commits into from
Oct 15, 2021
Merged

report udp stats from validator #20587

merged 12 commits into from
Oct 15, 2021

Conversation

jbiseda
Copy link
Contributor

@jbiseda jbiseda commented Oct 11, 2021

Problem

UDP stats are collected from net-stats.sh which isn't commonly run by validators.

Summary of Changes

Collect UDP stats from the validator process.

Screen Shot 2021-10-11 at 4 02 51 AM

Fixes #

@jbiseda jbiseda marked this pull request as ready for review October 11, 2021 11:57
@codecov
Copy link

codecov bot commented Oct 11, 2021

Codecov Report

Merging #20587 (5ec0d9a) into master (ad0a88f) will increase coverage by 0.0%.
The diff coverage is 84.5%.

@@           Coverage Diff            @@
##           master   #20587    +/-   ##
========================================
  Coverage    81.8%    81.9%            
========================================
  Files         495      496     +1     
  Lines      138041   138138    +97     
========================================
+ Hits       113049   113168   +119     
+ Misses      24992    24970    -22     

Copy link
Contributor

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the idea looks good to me. I have some minor comments on tweaks that I've left below. I haven't looked at getting these stats before, so my review assumes that part is correct. Lastly, I'll defer to @carllin for things like (1) the process/module name, and (2) how frequently this task should run (i.e. sleep & sample interval).

thread_hdl: JoinHandle<()>,
}

#[allow(dead_code)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few spots where #[allow(dead_code)] has been added, but I wasn't sure why. Is it needed for the non-linux case due to the #[cfg(not(target_os = "linux"))] spots?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is due to the #[cfg(not(target_os = "linux"))] spots. I didn't want to remove these for non-linux so that the unit test can run on non-linux platforms. This seemed cleaner than prefixing with "_". Is there a better way to annotate this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Unfortunately I don't know of a better way to handle this.

Copy link
Contributor

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me. I think someone else should give a final Approve, since I don't have a good picture of the whole system to know if there are any unknown-to-me negative interactions due to adding a new service.

You know, maybe adding a CLI arg like --no-system-monitor to disable this might be a good idea? From what I understand, most voting/staked validators turn off metrics, so it seems like turning off this service might be a reasonable request.

@carllin What do you think?


#[cfg(target_os = "linux")]
fn report_net_stats(udp_stats: &mut Option<UdpStats>) {
match read_udp_stats("/proc/net/snmp") {
Copy link
Contributor

@carllin carllin Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Extract /proc/net/snmp into a constant string above

Copy link
Contributor

@carllin carllin Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On startup of this service in new(), could we verify that we have access to this path, and if not, return an error and fail validator startup

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! This path could be a CLI arg too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carllin would we want to fail startup? It will currently log a WARN message. I don't think access would typically be a problem. /proc/net is a symlink to /proc/self/net which is owned by the process.

@brooksprumo I don't think this path would change unless you hacked and recompiled your kernel. Changing this path would break things like netstat.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbiseda Gotcha. I guess if we ever need to support a different path it'll be easy to add it then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbiseda yeah people have a tendency to not check their logs :P, I think a hard failure so that validators don't silently underreport is a good idea

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carllin made this a fatal error or startup

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbiseda, hmm I dont see the change, is this pushed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carllin pushed now

@jbiseda
Copy link
Contributor Author

jbiseda commented Oct 12, 2021

You know, maybe adding a CLI arg like --no-system-monitor to disable this might be a good idea? From what I understand, most voting/staked validators turn off metrics, so it seems like turning off this service might be a reasonable request.

I'm adding this here because we really would like to have these stats and validators aren't running the net-stats.sh script. I think from the last incident there were only 2 validators that had reported these OS level stats. I would suspect that if you didn't want to report stats you would just leave SOLANA_METRICS_CONFIG unset.

carllin
carllin previously approved these changes Oct 15, 2021
@mergify mergify bot dismissed carllin’s stale review October 15, 2021 00:46

Pull request has been modified.

Copy link
Contributor

@t-nelson t-nelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@jbiseda jbiseda merged commit 4cac662 into solana-labs:master Oct 15, 2021
@jbiseda jbiseda added the v1.8 label Oct 19, 2021
mergify bot pushed a commit that referenced this pull request Oct 19, 2021
(cherry picked from commit 4cac662)

# Conflicts:
#	core/src/validator.rs
mergify bot added a commit that referenced this pull request Oct 20, 2021
* report udp stats from validator (#20587)

(cherry picked from commit 4cac662)

# Conflicts:
#	core/src/validator.rs

* resolve merge conflicts

Co-authored-by: Jeff Biseda <jbiseda@gmail.com>
dankelleher pushed a commit to identity-com/solana that referenced this pull request Nov 24, 2021
frits-metalogix added a commit to identity-com/solana that referenced this pull request Nov 24, 2021
@jbiseda jbiseda deleted the net-stats branch January 31, 2023 23:59
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants