-
Notifications
You must be signed in to change notification settings - Fork 601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloud_storage: Add failure injection to cloud storage API #9664
base: dev
Are you sure you want to change the base?
Conversation
137c7e9
to
952f13a
Compare
9134d5d
to
442b3f6
Compare
static std::optional<ss::sstring> | ||
get_env_var(const char* name, const char* warn_message) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we expose it through admin /debug or /finjector endpoints etc...?
void failure_injector::load_from_string(const char* ctrl_string) { | ||
std::stringstream str(ctrl_string); | ||
std::string t; | ||
if (std::getline(str, t, ':')) { | ||
_freq = boost::lexical_cast<int32_t>(t); | ||
while (std::getline(str, t, ':')) { | ||
auto res = string_switch<injected_failure_type>(t) | ||
.match("failure", injected_failure_type::failure) | ||
.match("none", injected_failure_type::none) | ||
.match( | ||
"no_such_key", injected_failure_type::no_such_key) | ||
.match("throttle", injected_failure_type::throttle) | ||
.default_match(injected_failure_type::none); | ||
_types.push_back(res); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking a bit about how to generalize this, I wonder if it makes sense to add these to e.g. the node_config as a set of properties that encode these same errors.
Something like:
// Failure rates.
property<std::optional<float>> cloud_storage_remote_failure_rate;
property<std::optional<float>> cloud_storage_remote_no_such_key_rate;
// Defines some bounds with which to induce random latencies.
property<std::optional<int>> cloud_storage_remote_inject_latency_min_ms;
property<std::optional<int>> cloud_storage_remote_inject_latency_max_ms;
We should still be able to seed the RNG in a way that is somewhat deterministic, and I guess it wouldn't be all too different from what you have here, but reusing properties in this way may be more intuitive given we use them widely already today.
I'm fine proceeding with the current style since it's a net win for code coverage, but it's worth discussing how to lower the barrier to entry of more failure injection and increase intuitiveness if we can to begin using these more widely across the codebase.
Add failure injection mechanism into the cloud_storage::remote. The mechanism can be enabled using the RP_INJECT_CLOUD_STORAGE_API_ERRORS environment variable. The format of the env var follows: <frequency>:<failure-type1>:<failure-type2> .. The value starts with frequency parameter. It's an integer greater than 0. Value 1 means that every cloud storage api call triggers a failure, 3 means every third, 10 means 1 call out of 10 etc. The 'frequency' is followed by a list of failure types. There are four types of failures. - throttle - no_such_key - failure - none The 'throttle' triggers 'SlowDown' responses. The 'no_such_key' the retryable responses, and 'failure' triggers not retryable responses. The 'none' type doesn't trigger an error response but still generates a sleep. Example: "100:failiure:throttle:none" The failure types can be repeated to make one of them more frequent like this: "100:throttle:throttle:failure none" The failure is implemented in the following way. First, the RNG is used to decide if the failure has to be injected. If RNG() % Frequency is equal to 0 then the failure is injected. Then the RNG is used to select a failure type. Then the same approach is used to select a timeout value. The upper limit for timeout is a timeout of the current api call. When the generated failure is used it invokes 'ss::sleep_abortable' and then returns a selected error. This is done in 'generate_get_failure' and 'generate_put_failure' methods. The method may return nullopt if the failure type is 'none' but it will still invoke 'ss::sleep_abortable'. This is needed to simulate situation when the actual failure is not happening but the operation times out anyway. The 'no_such_key' error can only be triggered by the 'get' request.
Add new parameter to the settings and pass failure injection env var if it's set.
Run basic E2E test for tiered storage with failure injection turned on. The test is similar to the existing one but produces 10x more data to guarante some meaningful number of injected failures to appear.
442b3f6
to
358511a
Compare
Add failure injection mechanism into the cloud_storage::remote. The mechanism can be enabled using the RP_INJECT_CLOUD_STORAGE_API_ERRORS environment variable. The format of the env var follows:
The value starts with frequency parameter. It's an integer greater than 0. Value 1 means that every cloud storage api call triggers a failure, 3 means every third, 10 means 1 call out of 10 etc.
The 'frequency' is followed by a list of failure types. There are four types of failures.
Example:
The failure types can be repeated to make one of them more frequent like this:
The failure is implemented in the following way. First, the RNG is used to decide if the failure has to be injected. If RNG() % Frequency is equal to 0 then the failure is injected. Then the RNG is used to select a failure type. Then the same approach is used to select a timeout value. The upper limit for timeout is a timeout of the current api call.
When the generated failure is used it invokes 'ss::sleep_abortable' and then returns a selected error. This is done in 'generate_get_failure' and 'generate_put_failure' methods. The method may return nullopt if the failure type is 'none' but it will still invoke 'ss::sleep_abortable'. This is needed to simulate situation when the actual failure is not happening but the operation times out anyway.
Backports Required
Release Notes
Improvements