-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Conversation
Codecov Report
@@ Coverage Diff @@
## master #20587 +/- ##
========================================
Coverage 81.8% 81.9%
========================================
Files 495 496 +1
Lines 138041 138138 +97
========================================
+ Hits 113049 113168 +119
+ Misses 24992 24970 -22 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the idea looks good to me. I have some minor comments on tweaks that I've left below. I haven't looked at getting these stats before, so my review assumes that part is correct. Lastly, I'll defer to @carllin for things like (1) the process/module name, and (2) how frequently this task should run (i.e. sleep & sample interval).
core/src/system_monitor_service.rs
Outdated
thread_hdl: JoinHandle<()>, | ||
} | ||
|
||
#[allow(dead_code)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few spots where #[allow(dead_code)]
has been added, but I wasn't sure why. Is it needed for the non-linux case due to the #[cfg(not(target_os = "linux"))]
spots?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is due to the #[cfg(not(target_os = "linux"))]
spots. I didn't want to remove these for non-linux so that the unit test can run on non-linux platforms. This seemed cleaner than prefixing with "_". Is there a better way to annotate this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Unfortunately I don't know of a better way to handle this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks good to me. I think someone else should give a final Approve, since I don't have a good picture of the whole system to know if there are any unknown-to-me negative interactions due to adding a new service.
You know, maybe adding a CLI arg like --no-system-monitor
to disable this might be a good idea? From what I understand, most voting/staked validators turn off metrics, so it seems like turning off this service might be a reasonable request.
@carllin What do you think?
core/src/system_monitor_service.rs
Outdated
|
||
#[cfg(target_os = "linux")] | ||
fn report_net_stats(udp_stats: &mut Option<UdpStats>) { | ||
match read_udp_stats("/proc/net/snmp") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Extract /proc/net/snmp
into a constant string above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On startup of this service in new()
, could we verify that we have access to this path, and if not, return an error and fail validator startup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! This path could be a CLI arg too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carllin would we want to fail startup? It will currently log a WARN message. I don't think access would typically be a problem. /proc/net
is a symlink to /proc/self/net
which is owned by the process.
@brooksprumo I don't think this path would change unless you hacked and recompiled your kernel. Changing this path would break things like netstat
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbiseda Gotcha. I guess if we ever need to support a different path it'll be easy to add it then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbiseda yeah people have a tendency to not check their logs :P, I think a hard failure so that validators don't silently underreport is a good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carllin made this a fatal error or startup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbiseda, hmm I dont see the change, is this pushed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carllin pushed now
I'm adding this here because we really would like to have these stats and validators aren't running the net-stats.sh script. I think from the last incident there were only 2 validators that had reported these OS level stats. I would suspect that if you didn't want to report stats you would just leave |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
(cherry picked from commit 4cac662) # Conflicts: # core/src/validator.rs
This reverts commit 46133b6.
Problem
UDP stats are collected from net-stats.sh which isn't commonly run by validators.
Summary of Changes
Collect UDP stats from the validator process.
Fixes #