Description
The following work described is to support the creation of a failure injection framework for roachprod, see parent issue for details.
The current failure injection story in roachprod consists of ad hoc implementations scattered across roachtests and DRT. For example:
The network/authentication
roachtest creates a network partition around a leaseholder node. It does so directly in the test itself using iptables
.
cockroach/pkg/cmd/roachtest/tests/network.go
Lines 253 to 266 in 049c30a
This is not ideal as:
- It cannot be reused in other roachtests or DRT. DRT has its own very similar implementation found in
cockroach/pkg/cmd/roachtest/operations/network_partition.go
. - It requires the test to understand implementation details surrounding
iptables
, even though it should only care that at a high level a network partition has been created.
disk stall
failure injection proves to be a better example of reusability, having been refactored as a roachtestutil
helper found in cockroach/pkg/cmd/roachtest/roachtestutil/disk_stall.go
. This allows it to be reused across roachtests as well as DRT. However, this is still short of ideal, as it cannot be used on roachprod clusters directly through CLI.
Additionally, there is a lack of unit testing to verify that failures are actually injected. In the past, a papercut in the iptables
command caused the wrong port to be partitioned, and the test was effectively not testing anything, which went unnoticed for a while.
Instead, roachprod should support a Failure Injection library (FI). The FI library should supply a set of high level failures that can be injected to a roachprod cluster. The implementation details should be hidden from the caller such that only the type of failure and a Run()
function needs to be specified.
An example of such an interface:
type FailureMode interface {
Setup(func()) error
Attack(Run func(Args ...[]interface{})) error
Restore(Run func(Args ...[]interface{})) error
}
- Setup initializes any dependencies, could be a noop.
- Attack applies the requested failure.
- Restore reverts the requested failure.
The primary goal of such a library is to eventually be used in the aforementioned Failure Injection Test Framework. In the interim, it will also help consolidate and improve the existing failure injection techniques used in roachtest/DRT. Additionally, we will be able to more rigorously test that said failures work as expected to avoid regressions like described above.
The list of supported failures can grow endlessly long if we let it. Instead, an initial attempt should focus on consolidating existing failures, and a few target FI techniques e.g. jepsen network FI.
Once initial support of a FI framework is complete, we can revisit adding more complex failure injection scenarios.
Jira issue: CRDB-46442
Activity