Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_storage: Add failure injection to cloud storage API #9664

Open
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

Lazin
Copy link
Contributor

@Lazin Lazin commented Mar 27, 2023

Add failure injection mechanism into the cloud_storage::remote. The mechanism can be enabled using the RP_INJECT_CLOUD_STORAGE_API_ERRORS environment variable. The format of the env var follows:

	<frequency>:<failure-type1>:<failure-type2> ..

The value starts with frequency parameter. It's an integer greater than 0. Value 1 means that every cloud storage api call triggers a failure, 3 means every third, 10 means 1 call out of 10 etc.

The 'frequency' is followed by a list of failure types. There are four types of failures.

  • throttle
  • no_such_key
  • failure
  • none The 'throttle' triggers 'SlowDown' responses. The 'no_such_key' the retryable responses, and 'failure' triggers not retryable responses. The 'none' type doesn't trigger an error response but still generates a sleep.

Example:

"100:failiure:throttle:none"

The failure types can be repeated to make one of them more frequent like this:

"100:throttle:throttle:failure:none"

The failure is implemented in the following way. First, the RNG is used to decide if the failure has to be injected. If RNG() % Frequency is equal to 0 then the failure is injected. Then the RNG is used to select a failure type. Then the same approach is used to select a timeout value. The upper limit for timeout is a timeout of the current api call.

When the generated failure is used it invokes 'ss::sleep_abortable' and then returns a selected error. This is done in 'generate_get_failure' and 'generate_put_failure' methods. The method may return nullopt if the failure type is 'none' but it will still invoke 'ss::sleep_abortable'. This is needed to simulate situation when the actual failure is not happening but the operation times out anyway.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.1.x
  • v22.3.x
  • v22.2.x

Release Notes

Improvements

  • Add failure injection for cloud storage API

@Lazin Lazin added the area/cloud-storage Shadow indexing subsystem label Mar 27, 2023
@Lazin Lazin force-pushed the feature/inject-cloud-errors branch from 137c7e9 to 952f13a Compare March 27, 2023 14:59
@Lazin Lazin requested review from VladLazar and andrwng March 27, 2023 15:01
@Lazin Lazin force-pushed the feature/inject-cloud-errors branch 2 times, most recently from 9134d5d to 442b3f6 Compare March 27, 2023 19:44
Comment on lines +47 to +48
static std::optional<ss::sstring>
get_env_var(const char* name, const char* warn_message) {
Copy link
Member

@dotnwat dotnwat Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we expose it through admin /debug or /finjector endpoints etc...?

Comment on lines +149 to +173
void failure_injector::load_from_string(const char* ctrl_string) {
std::stringstream str(ctrl_string);
std::string t;
if (std::getline(str, t, ':')) {
_freq = boost::lexical_cast<int32_t>(t);
while (std::getline(str, t, ':')) {
auto res = string_switch<injected_failure_type>(t)
.match("failure", injected_failure_type::failure)
.match("none", injected_failure_type::none)
.match(
"no_such_key", injected_failure_type::no_such_key)
.match("throttle", injected_failure_type::throttle)
.default_match(injected_failure_type::none);
_types.push_back(res);
}
}
}
Copy link
Contributor

@andrwng andrwng Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking a bit about how to generalize this, I wonder if it makes sense to add these to e.g. the node_config as a set of properties that encode these same errors.

Something like:

// Failure rates.
property<std::optional<float>> cloud_storage_remote_failure_rate;
property<std::optional<float>> cloud_storage_remote_no_such_key_rate;

// Defines some bounds with which to induce random latencies.
property<std::optional<int>> cloud_storage_remote_inject_latency_min_ms;
property<std::optional<int>> cloud_storage_remote_inject_latency_max_ms;

We should still be able to seed the RNG in a way that is somewhat deterministic, and I guess it wouldn't be all too different from what you have here, but reusing properties in this way may be more intuitive given we use them widely already today.

I'm fine proceeding with the current style since it's a net win for code coverage, but it's worth discussing how to lower the barrier to entry of more failure injection and increase intuitiveness if we can to begin using these more widely across the codebase.

Lazin added 3 commits March 28, 2023 08:04
Add failure injection mechanism into the cloud_storage::remote. The
mechanism can be enabled using the RP_INJECT_CLOUD_STORAGE_API_ERRORS
environment variable. The format of the env var follows:

	<frequency>:<failure-type1>:<failure-type2> ..

The value starts with frequency parameter. It's an integer greater than
0. Value 1 means that every cloud storage api call triggers a failure,
3 means every third, 10 means 1 call out of 10 etc.

The 'frequency' is followed by a list of failure types. There are four
types of failures.
  - throttle
  - no_such_key
  - failure
  - none
The 'throttle' triggers 'SlowDown' responses. The 'no_such_key' the
retryable responses, and 'failure' triggers not retryable responses.
The 'none' type doesn't trigger an error response but still generates
a sleep.

Example:

	"100:failiure:throttle:none"

The failure types can be repeated to make one of them more frequent like
this:

	"100:throttle:throttle:failure none"

The failure is implemented in the following way. First, the RNG is used
to decide if the failure has to be injected. If RNG() % Frequency is
equal to 0 then the failure is injected. Then the RNG is used to select
a failure type. Then the same approach is used to select a timeout
value. The upper limit for timeout is a timeout of the current api call.

When the generated failure is used it invokes 'ss::sleep_abortable' and
then returns a selected error. This is done in 'generate_get_failure' and
'generate_put_failure' methods. The method may return nullopt if the
failure type is 'none' but it will still invoke 'ss::sleep_abortable'.
This is needed to simulate situation when the actual failure is not
happening but the operation times out anyway.

The 'no_such_key' error can only be triggered by the 'get' request.
Add new parameter to the settings and pass failure injection env var if
it's set.
Run basic E2E test for tiered storage with failure injection turned on.
The test is similar to the existing one but produces 10x more data to
guarante some meaningful number of injected failures to appear.
@Lazin Lazin force-pushed the feature/inject-cloud-errors branch from 442b3f6 to 358511a Compare March 28, 2023 12:05
@Lazin Lazin self-assigned this Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem area/redpanda
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants