-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Initial checkin of resource monior plugin #9098
Conversation
|
I'm not sure if 95% is a good default value for file system space. By default some file systems reserve 5% for superuser. So if nodeos is running as a non-privileged user, then writes will fail before the new check will fail. (eg https://ma.ttias.be/change-reserved-blocks-ext3-ext4-filesystem-linux/) Of course the check is customizable, but maybe more conservative would be better as a default? |
|
I'm a bit concerned this isn't easy enough for users to make use of: it requires them to manually configure the directories to monitor. IMO we should explore having other plugins register their directories. For example, maybe trace plugin does something like if(auto resmon_plugin = app().find_plugin<resource_monitor_pligin>())
resmon_plugin->monitor_directory(trace_dir);This way the directories that are used are automatically monitored (we could maybe even make resource_monitor_plugin default enabled). Be aware in the above example we would need to be mindful about plugin initialization and startup order -- trace_api_plugin configures trace_dir during its initialize() but that may be before resource_monitor_plugin's initialize(). |
|
I was thinking of always monitoring anything under "data" directory but then thought the user |
b1bart
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, this system is not that complex however, the description mentions a path for future additions. As this grows it will need a strategy for testability.
In other parts of the code we have tried to abstract dependencies (in this case boost::filesystem calls) from the code so that unit tests can create very simple "mocks" for nominal and edge cases. Particularly as we start to measure other system stats we may have a very difficult time triggering edge cases safely.
As a model, you can look at the trace-api plugin and the template-based "provider pattern" or pull from other sources of patterns. The primary point is to allow for controlled repeatable testing of the logic that will ultimately product a production environment. How this is accomplished is a secondary or tertiary concern.
The specifics will have to be investigated and studied. I believe the plugin object gets instantiated as a global constructor so all plugins -- even plugins not enabled -- will be available via if(auto resmon_plugin = app().find_plugin<resource_monitor_pligin>())
resmon_plugin->monitor_directory(trace_dir);may call in to resource_monitor_plugin before |
|
What you do know is at time of |
|
I think I can add a function to be called by resource monitor's plugin_startup. |
… and integration test cases, terminate thread right after plugin shutdown, and make apps register monitored directories
plugins/resource_monitor_plugin/include/eosio/resource_monitor_plugin/file_space_handler.hpp
Outdated
Show resolved
Hide resolved
plugins/resource_monitor_plugin/include/eosio/resource_monitor_plugin/file_space_handler.hpp
Outdated
Show resolved
Hide resolved
plugins/resource_monitor_plugin/include/eosio/resource_monitor_plugin/file_space_handler.hpp
Outdated
Show resolved
Hide resolved
| EOS_ASSERT(threshold > 0 && threshold < 100, chain::plugin_config_exception, | ||
| "\"resource-monitor-space-threshold\" must be greater than 0 and less than 100"); | ||
|
|
||
| space_handler.set_threshold(threshold); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider moving the assert into the set_threshold method so you don't have to guard against overflow in your calculation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to capture invalid input as early as possible.
|
Should the plugin be default enabled? Seems like now that it is monitoring all the directories automatically that would be desirable? |
…nge bash script to python tomorrow)
…n test from Bash to Python next), make shutdown on exceeding threshold as default, increase interval upper bound to 5 minutes
plugins/resource_monitor_plugin/include/eosio/resource_monitor_plugin/file_space_handler.hpp
Outdated
Show resolved
Hide resolved
plugins/resource_monitor_plugin/include/eosio/resource_monitor_plugin/file_space_handler.hpp
Outdated
Show resolved
Hide resolved
plugins/resource_monitor_plugin/include/eosio/resource_monitor_plugin/file_space_handler.hpp
Outdated
Show resolved
Hide resolved
…e_monitor_plugin_impl, changed integration test script from Bash to Python
…and relax timing requirements for test cases
Change Description
This PR resolves https://blockone.atlassian.net/projects/EPE/issues/EPE-43.
Background
It was reported from a customer that when file system which
data/blocksbelongs to isrunning out of space, the producer continued to produce blocks and update state but
blocks.logwas corrupted in a sense that it no longer contained all the irreversible blocks.It was also observed that when file system which
data/statebelongs is running out of space, nodeos will crash with SIGBUS as the state file is unable to acquire new pages.Design
The solution is to have a dedicated plugin to monitor resource usages (file system space for now,
CPU, memory, and networking bandwidth in the future). The plugin uses a thread to periodically
check space usage of file systems of directories being monitored. If space used
is over a predefined threshold, a graceful shutdown is initiated.
The directories to be monitored are configured. They can be any paths for of blocks, state, trace, state history, and others.
Alternative Solution
Alternative solution to blocks directory space exhaustion was considered: starting from
block_log::append, make sure all exceptions was handled properly. But it was deemed to betoo tedious and error prone to review all possible exceptions, and might likely break consensus. The plugin approach is more generic, monitoring any file systems and without modifying
existing code.
Additional Bug Fixed
This PR also fixes a bug in
log_irreversibleoflibraries/chain/controller.cpp.db.commitwas done beforeblog.append. This could be a problem ifblog.appendthrew.Now
db.commitis done afterblog.append. Even ifblog.appendthrows, DB can be rolled back.Consensus Changes
API Changes
Documentation Additions
resource_monitor_plugin is a new plugin. It is not must have but recommended to be used.
Configurable Parameters
resource-monitor-interval-seconds
Time in seconds between two consecutive checks of resource usage. It should be between 1 and 300 (5 minutes). This is optional; the default value is 2 seconds.
resource-monitor-space-threshold
Threshold in terms of percentage of used space vs total space. If used space is above (threshold - 5%), a warning is generated. If used space is above the threshold and resource-monitor-not-shutdown-on- threshold-exceeded is enabled, a graceful shutdown is initiated. The value should be between 6 and 99
resource-monitor-not-shutdown-on-threshold-exceeded
Used to indicate nodeos will not shutdown when threshold is exceeded.