Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use raft log to patch operation history in Robustness tests #15598

Closed
serathius opened this issue Mar 31, 2023 · 5 comments
Closed

Use raft log to patch operation history in Robustness tests #15598

serathius opened this issue Mar 31, 2023 · 5 comments

Comments

@serathius
Copy link
Member

What would you like to be added?

Robustness tests execute traffic against a cluster and record their history so they can be validated for linearizability. Unfortunately linearizing history is an NP-Hard problem, meaning it has exponential complexity by default. We maneuver around the problem by special algorithms (https://github.com/anishathalye/porcupine#porcupine-- uses https://arxiv.org/pdf/1504.00204.pdf) and history patching.

History patching is needed to remove operations that cause linearization to require exponential complexity. Those operations are client requests that returned error for any reason, connection timeout, request timeout etc. In those cases we don't know if the request was persisted or not, and if it was persisted what revision it got. Without knowing this we need to consider each possibility and any order of requests. For sequence of requests of length n, we already get 2^n states (persisted or not) that is also multiplied by n! (order of operations).

History patching uses independent history of events collected via watches to either remove failed requests from the operation history or provides them with their revision. This can be done for PUT and TXN as they are unique because robustness tests write unique value in each request.

Problem

Watch events only include transactions that succeeded. Delete request for a non-present key will not be present in watch history. Same for TXN transaction that failed it's condition and had empty onFail operations. Those operations where still executed by etcd and their condition checked, however current patching marks them as never existing.

Using watch to patch operation history means that etcd could still have an linearization issue on condition checking and we would never know about this. We need a way to not only collect operations that succeeded by operations that were applied to backend.

Proposed solution

  • Validate best way to access raft log in cluster, either by WAL files or connecting to raft as a learner
  • Use raft log to decide whether operation was executed or not.
  • Still use watch history, but only to provide revision for operations peristed.

Why is this needed?

Patching operation history is critical to make robustness tests performant, however in it's current state limits our ability to validate correctness. We should be aware of this flaw and when needed we should fix it.

Not a priority for now, but an important thing to have in mind.

@stale
Copy link

stale bot commented Aug 12, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@serathius
Copy link
Member Author

I should consider prioritizing this to implement Compaction for Kubernetes. With more aggressive compaction watches will be interrupted, which would lead to gaps in our event history. With gaps we could not patch operation history, which could lead to linearization timeouts.

@ahrtr
Copy link
Member

ahrtr commented Apr 9, 2024

Overall it seems not a good idea to verify etcd's correctness depending on WAL data. It's a little strange that you verify etcd's correctness, but you trust the WAL data generated by etcd.

We should regard the whole etcd as a black box, and interactive & verify it using its public APIs. The WAL data should be transparent to the robustness test, and it's supposed to be used by etcdserver itself only.

A related topic: should we patch the history using the watch events? My thoughts:

  • I tend to keep them totally separated to simplify the robustness test's implementation. We verify the linearizability using porcupine & model, and verify watch doesn't break any invariant properties; they are two different & separate verifications. But on the other hand, it indeed doesn't help to reduce the exponential complexity caused by the failed client requests for any reason. So I don't have very strong opinion on this for now, although I tend to keep them totally separated.
  • At least, we should consider patching the history using watch events, only after we have verified that the watch events do not break any invariant properties.

@serathius
Copy link
Member Author

Overall it seems not a good idea to verify etcd's correctness depending on WAL data.

I don't depend on WAL data for checking correctness, this is done using client requests that treat etcd as black box. I use it to remove dead ends from linearization, it should not impact the main linearization path. If there would be any error in etcd with regards to WAL, the robustness tests would still catch it.

We should regard the whole etcd as a black box, and interactive & verify it using its public APIs. The WAL data should be transparent to the robustness test, and it's supposed to be used by etcdserver itself only.

If I had a quantum computer then yes, but unfortunately we are still far from solving NP-Complete problems. We cannot do this fully in a black box approach and we need some tricks to succeed.

A related topic: should we patch the history using the watch events?

We already do that, that's why we could not implemented compaction as with compaction we no longer can guarantee that we will observe all events.

@serathius
Copy link
Member Author

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants