Skip to content

roachprod/failure-injection: add support for a failure injection library #138970

Open
@DarrylWong

Description

The following work described is to support the creation of a failure injection framework for roachprod, see parent issue for details.


The current failure injection story in roachprod consists of ad hoc implementations scattered across roachtests and DRT. For example:

The network/authentication roachtest creates a network partition around a leaseholder node. It does so directly in the test itself using iptables.

netConfigCmd := fmt.Sprintf(`
# ensure any failure fails the entire script.
set -e;
# Setting default filter policy
sudo iptables -P INPUT ACCEPT;
sudo iptables -P OUTPUT ACCEPT;
# Drop any node-to-node crdb traffic.
sudo iptables -A INPUT -p tcp --dport {pgport%s} -j DROP;
sudo iptables -A OUTPUT -p tcp --dport {pgport%s} -j DROP;
sudo iptables-save
`,

This is not ideal as:

  • It cannot be reused in other roachtests or DRT. DRT has its own very similar implementation found in cockroach/pkg/cmd/roachtest/operations/network_partition.go.
  • It requires the test to understand implementation details surrounding iptables, even though it should only care that at a high level a network partition has been created.

disk stall failure injection proves to be a better example of reusability, having been refactored as a roachtestutil helper found in cockroach/pkg/cmd/roachtest/roachtestutil/disk_stall.go. This allows it to be reused across roachtests as well as DRT. However, this is still short of ideal, as it cannot be used on roachprod clusters directly through CLI.

Additionally, there is a lack of unit testing to verify that failures are actually injected. In the past, a papercut in the iptables command caused the wrong port to be partitioned, and the test was effectively not testing anything, which went unnoticed for a while.


Instead, roachprod should support a Failure Injection library (FI). The FI library should supply a set of high level failures that can be injected to a roachprod cluster. The implementation details should be hidden from the caller such that only the type of failure and a Run() function needs to be specified.

An example of such an interface:

type FailureMode interface {
    Setup(func()) error
    Attack(Run func(Args ...[]interface{})) error
    Restore(Run func(Args ...[]interface{})) error
}
  • Setup initializes any dependencies, could be a noop.
  • Attack applies the requested failure.
  • Restore reverts the requested failure.

The primary goal of such a library is to eventually be used in the aforementioned Failure Injection Test Framework. In the interim, it will also help consolidate and improve the existing failure injection techniques used in roachtest/DRT. Additionally, we will be able to more rigorously test that said failures work as expected to avoid regressions like described above.


The list of supported failures can grow endlessly long if we let it. Instead, an initial attempt should focus on consolidating existing failures, and a few target FI techniques e.g. jepsen network FI.

Once initial support of a FI framework is complete, we can revisit adding more complex failure injection scenarios.

Jira issue: CRDB-46442

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-testengTestEng Teamtarget-release-25.2.0

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions