-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Etcd robustness tests utilize new LazyFS feature to simulate power failure #16597
Comments
cc @mj-ramos |
I have read through the new LazyFS feature and found one problem. Robustness tests assume that they can control when the failure is injected. Reason is that we are testing etcd behavior correctness, not only whether data written to disk is consistent. Example scenario for KillFailpoint :
What we would need from LazyFS is to implement a way to inject a "reorder" or "split_write" at an arbitrary time, similar to how clear cache is invoked by via a unix socket. I also see that there is a new comment @mj-ramos would it be possible to allow injecting "reorder" or "split_write" via a unix socket? Also could you provide an example how to inject |
Hello! We are thrilled about the opportunity to contribute to etcd testing. In this discussion, I will explain why we have chosen to use a configuration file and present some solutions for integrating LazyFS into etcd testing. The three newly introduced fault types differ significantly from the
One might want to "split" the last write into two smaller ones and just persist one of these. In such case, LazyFS will be configured to split the 3th write issued to the file With that said, we can incorporate support for injecting these two fault types through the FIFO. There are two possible scenarios:
If one wants to be sure that LazyFS accounts for all etcd write operations, the fault can be sent through the FIFO right before starting etcd. We can even create another FIFO where we write information indicating that the fault has been acknowledged by LazyFS. etcd can then read from this FIFO and initiate its execution after that.
|
I understand that using a on demand failure injection creates two problems as you noted:
Are there any problems I missed? I don't think either are a problem for etcd robustness tests. See https://github.com/etcd-io/gofail library used by etcd to inject failpoints on critical code paths. It allows us to setup failpoints in both modes. On initial start via environment variables and on demand via http request. In robustness tests we use the on the on demand mode. I would recommend having similar parity for modes of triggering failpoints in LazyFS. To go over how we handle those problems, asynchromous is ok as we inject an etcd panic and wait for etcd process to crash within some expected time. Similar think can be done for LazyFS, we could setup a crash and wait for LazyFS process to exit. As for reproducability, etcd robustness tests already don't have 100% reproducability as we are verifying parallel operations which we cannot guarantee order of execution. It's more important for us to know and control the timing of the failure injection, then having it be repeatable but happen at unknown time. In the report but you sent, you injected failure on first write. It's nice feature to inject failure on exact first write, but not very practical as most databases run for days or weeks without downtime, so testing the initialization is not the main concern. As for the |
Hi, I apologize for the late reply. I've been quite busy lately. I get the idea, and it is doable. We will consider the introduction of such possibility in LazyFS in the near future. When LazyFS crashes, the mount point becomes inaccessible, but the files are preserved in the root point. Attempting to execute system calls at this point will result in errors such as "Transport endpoint is not connected." In the case of etcd v3.4.25, for example, it stops executing upon encountering these errors as it cannot access its database. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
What would you like to be added?
Followup from #16596
Why is this needed?
Improve etcd resiliency
The text was updated successfully, but these errors were encountered: