roachprod/failure-injection: add support for a failure injection library

The following work described is to support the creation of a failure injection framework for roachprod, [see parent issue for details](https://github.com/cockroachdb/cockroach/issues/138958).

----

The current failure injection story in roachprod consists of ad hoc implementations scattered across roachtests and DRT. For example:

The `network/authentication` roachtest creates a network partition around a leaseholder node. It does so directly in the test itself using `iptables`.

https://github.com/cockroachdb/cockroach/blob/049c30ac6956fa8f307ba2adae8b4388f5907859/pkg/cmd/roachtest/tests/network.go#L253-L266

This is not ideal as:

- It cannot be reused in other roachtests or DRT. DRT has its own very similar implementation found in `cockroach/pkg/cmd/roachtest/operations/network_partition.go`.
- It requires the test to understand implementation details surrounding `iptables`, even though it should only care that at a high level a network partition has been created.

`disk stall` failure injection proves to be a better example of reusability, having been refactored as a `roachtestutil` helper found in `cockroach/pkg/cmd/roachtest/roachtestutil/disk_stall.go`. This allows it to be reused across roachtests as well as DRT. However, this is still short of ideal, as it cannot be used on roachprod clusters directly through CLI.

Additionally, there is a lack of unit testing to verify that failures are actually injected. [In the past, a papercut](https://github.com/cockroachdb/cockroach/commit/e040cd1fb8d9b52d5e9056053982e2be8869b099) in the `iptables` command caused the wrong port to be partitioned, and the test was effectively not testing anything, which went unnoticed for a while.

----

Instead, roachprod should support a Failure Injection library (FI). The FI library should supply a set of high level failures that can be injected to a roachprod cluster. The implementation details should be hidden from the caller such that only the type of failure and a `Run()` function needs to be specified.

An example of such an interface:

```
type FailureMode interface {
    Setup(func()) error
    Attack(Run func(Args ...[]interface{})) error
    Restore(Run func(Args ...[]interface{})) error
}
```

- Setup initializes any dependencies, could be a noop.
- Attack applies the requested failure.
- Restore reverts the requested failure.


The primary goal of such a library is to eventually be used in the aforementioned Failure Injection Test Framework. In the interim, it will also help consolidate and improve the existing failure injection techniques used in roachtest/DRT. Additionally, we will be able to more rigorously test that said failures work as expected to avoid regressions like described above.

----

The list of supported failures can grow endlessly long if we let it. Instead, an initial attempt should focus on consolidating existing failures, and a few target FI techniques e.g. [jepsen network FI](https://github.com/cockroachdb/jepsen/blob/4beba6c2d63587ab9ffaa24580f73b5674e65e55/jepsen/src/jepsen/net.clj#L57-L109).

Once initial support of a FI framework is complete, we can revisit adding more complex failure injection scenarios.


Jira issue: CRDB-46442

	netConfigCmd := fmt.Sprintf(`
	# ensure any failure fails the entire script.
	set -e;

	# Setting default filter policy
	sudo iptables -P INPUT ACCEPT;
	sudo iptables -P OUTPUT ACCEPT;

	# Drop any node-to-node crdb traffic.
	sudo iptables -A INPUT -p tcp --dport {pgport%s} -j DROP;
	sudo iptables -A OUTPUT -p tcp --dport {pgport%s} -j DROP;

	sudo iptables-save
	`,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachprod/failure-injection: add support for a failure injection library #138970

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development