-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discuss] Writes on NRT Replica with Remote Translog #3706
Comments
Well, I was thinking about this problem and realised that this would exist with segment uploads as well. Below is an use case -
|
The above scenario has been detailed and elaborated further in #3906 by @sachinpkale. |
Continuing further on the original point of discussion - Avoiding divergent writes with segment replication & remote translog 1. ContextAlong with segment replication, we plan to have only one source of truth for storing the translog in remote store. Currently, in NRT segment replication mode, the segments are replicated asynchronously and the primary is responsible for forwarding the writes requests to replicas where the same request is replayed on primary without queuing the request in indexing buffer but only keeping track of the translog. Having translog in replicas help to have durability in check and as well as for the shard to validate if it is still the primary. On segments, only primary is responsible for its creation and it also notifies the replicas to sync the segment by using NRT checkpoints. With remote translog on top of NRT segment replication, the remote store becomes the single source of truth for translog where just the current primary will be uploading the translog files. This obsoletes 1) the need for generation of translog on replicas as they can easily be replayed from the remote store in case of failovers/failures and 2) storing the translog on local disk on the primary. However, this does not necessarily eliminates the need for replication as the replication call also helps in validation for the assuming shard of being the real primary or not. Now, if we decide to use remote store for translog and also stop replication requests, we get into this situation where due to network partition, the older primary can continue to accept write requests and cause divergent writes. The issue is further accentuated due to changes done to allow writes in #3621. Invariants
As mentioned in the current issue description, there are 2 possible options that are being discussed in further detail in following sections -
More details on the above 2 possible options follows below - 2. No-op replicationCurrently if there are replicas configured, write requests are replicated synchronously across the insync replicas. This offers durability if the primary shard copy fails. When this happens, master leader promotes one of the insync replicas to become primary. If there are no insync replica, then index would show with red status. Now, with remote store for translog along with segment replication, we don’t need the translogs to be stored locally on the primary as well as the replicas. So, technically we can skip this part of keeping the write request in the indexing buffer and also skip the sync/async persistence of translogs on the replica’s data volume. This would require additional changes as mentioned in #3786. We can continue to use primary term validation logic that helps the stale partitioned primary to realise that it is no more a primary. 2.1 Concerns
3. Using remote store that guarantees read-after-write consistencyThe primary term validation that we want to achieve in the above section 2. No-op replication strategy can also be modelled using a remote store. The following could be a strategy that the primary can follow -
3.1 Concerns
4. Other areas of concern
5. Considerations
5.1 Evaluation of approaches simplified
Basis the evaluation, both the approaches can be used to handle divergent writes for remote translog. However, the major invariant i.e. durability and correctness, performance, and other considerations, we can modify current replication to make it no op and achieve the uber goal. Please feel free to correct if I have stated any facts incorrectly and feedback/suggestions are welcome. 6. References
|
cc: @mch2 @sachinpkale @Bukhtawar @kartg @andrross @reta @nknize. Looking for thoughts and suggestions. |
Caveat: I'm certainly not an expert on this part of OpenSearch so please correct me wherever I'm wrong One nitpick that might help clarify things is to replace the phrase "no op replication" with something like "primary term validation". If I understand correctly, you wouldn't need to send the document over the wire and instead would just be asking the replicas "is the primary term x?". One concern I have about that approach is that today when a document is replicated via the translog, the replicas are in effect performing two operations transactionally: 1) validating that the primary term hasn't changed, and 2) persisting the document. With the proposed change those operations will be split apart. The replicas will confirm the primary term hasn't changed and the remote store will persist the document. However, those operations will be split apart to different distributed components, which makes me a little nervous. Can you provide more details about how this will work? e.g. do you upload to remote store before or after confirming primary term from the replicas? What happens if the primary term changes in between those operations? etc Having said that, it does seem like a more natural fit to have the remote translog provide the same conceptual functionality provided by replicas today (persist this document only if the primary term is still x). It obviously puts a stronger requirement on the actual remote store because it has to be more than just dumb durable storage. |
A few comments from my side.
I think this is the right thing to do, even in case of primary with no replicas: when the connection with master is lost (split brain etc), it looks right for the primary to go into readonly mode and not accepts writes anymore.
It looks clever to store the |
Thanks @andrross for the observation, Here is how we intend to solve the above, but before that some context on how doc repl does it, below are the sequence of steps.
What follows from above is,
So if we go with the same status quo, we could always write operations to the remote xlog, once writes complete but before we acknowledge the write we can check for the primary term.
This part is something we should discuss more. The implementation would probably require some leasing protocol
We can always go with the "Remote Store" based proposal, which still provides us the limited guarantees we need. The part I like about this approach is it provides a nice property to better guard against isolated primaries as long as the cluster can guarantees the primary term invariant(no two primaries at any point can have the same primary term) at all times. The "No Op replication" approach inherently treats remote xlog store an extension of the "primary" translog storage instead of treating it as a separate distributed store in a bid to ensure that writes ever happens from the primary of the shard instead of guaranteeing "a single writer" at all times. The caveat here is that if the writer version (in this case primary term)change itself isn't guaranteed to be serialised, the remote store itself wouldn't be able to guarantee correctness, unless it implements some form of a leasing protocol mentioned above. One thing to note here is that, the current snapshot mechanism(only primaries snapshot) also use remote store while keeping the remote store logic dumb. I think we should confirm how cases like isolated primaries are dealt there and whether remote store or remote xlog and can leverage some properties. It might be possible that snapshots can have certain trade-offs which which might not be acceptable for the current use case.
@reta Remote translog pruning would be discussed separately #3766 |
Thanks @ashking94 for the detailed approaches. Regarding: |
@sachinpkale Yes, we are thinking to use the same remote store that will be used to store translog. Read-after-write is not just a requirement for primary term validation, but has to be a mandate for storing remotely as we want subsequent reads (after writes) to give us the data written so far, else we can get into an inconsistent state in case of auto restore/failovers/recoveries and likewise. |
@Bukhtawar Thanks for calling out that OpenSearch today allows for dirty reads. I think it is probably okay to stay with the status quo behavior here. (side note: this is where something like a TLA+ model would be helpful, mentioned in #3866, because it could be a more formal definition of the various consistency properties offered by the system).
With the current system, an isolated primary may briefly write to its local translog and serve dirty reads, but those writes will not be acknowledged and will not be replicated/persisted. In the remote case, the isolate primary will still write to the remote xlog before discovering it is no longer the primary. Is there a mechanism to ensure those writes will not be picked up by the new primary? Is that a concern here? (update, I see now this is discussed in "remote store" option which I commented on below, but I think this is also an issue for the "no op replication" case too?) |
For this I was thinking of relying on transactional properties of the remote store. I think it's unlikely that any of the remote object stores used in OpenSearch today would be able to meet that contract. Given that we can accept uncommitted reads (the current behavior), we probably don't need such a strong guarantee. |
Good write up @ashking94! Just some comments on "Using remote store that guarantees read-after-write consistency":
For this to work properly I think you might need proper check-and-set semantics in the remote store (i.e. conditional write). The above describes two steps: 1) read and validate, then 2) write new value. Is there protection here to ensure that that
I don't think we want to use timestamps for logical ordering like this because clock skew can happen, clocks can go backwards, etc. Are sequence numbers serialized relative to primary term changes? If we know the sequence number at which a primary term changes, then any sequence numbers newer than that from the previous primary term could be ignored. |
@andrross the check and set semantics is probably not needed since as discussed it's fine for translogs to have dirty writes we just need to guarantee we don't ack it back. So essentially what we need is write to translog in a primaryTerm(which could be stale) but ensure we do a read on the primary term again(this could be another metadata on the remote store that guarantees read after write consistency) after writing the translog but before ack-ing back The way to achieve that could be writing a file(stale.manifest) in all previous primaryTerm by a writer in primaryTerm (p+1) whose mere existence could mean if the path is writeable or not instead of a common primaryTerm file. The protocol then changes to checking whether the (stale.manifest) exists in the given primaryTerm path. What is a guaranteed by the cluster is that once a higher primaryTerm has been known to exist all older primaryTerm writers should go stale. What follows from here is that, alternatively even if we write a new file in a common directory prefixed by primaryTerm before we start the shard in the same primaryTerm, all writers should ensure they do a LIST and confirm if there are no other writers in the higher term that could be active before acknowledging writes.
Firstly its still fine to take or discard an unacknowledged write(since dirty writes are status quo), totally agree timestamp resolutions would be incorrect to start with and we would be using sequence numbers for all practical purposes of resolving conflicts |
Given there could be rough edges using a new "remote store" protocol with a mix of cluster state primitives and distributed store locking espl when file system stores lack optimistic concurrency control properties, it seems safer and optimal to use No Op replication protocol which already has been battle tested and relies on a consensus among all writers per request before making a call on acknowledging the request. I would seek feedbacks on the cases that would probably not be covered by this protocol before we start on ironing out rough edges for the "remote store" protocol |
I agree that no-op replication is more safer, but it introduces the dependency from primary to replica on the write path, which otherwise doesn't have to be there in case of NRT with remote storage. I get the challenges involved in depending on the properties of remote store. Couple of requirements here: read-after-write and uniqueness guarantee for primary term. Since primary term uniqueness is guaranteed by cluster management system, all we need is a way for the older primary to know that a new primary has started writing. 2 options: 1) Old primary can read a common file (or do a list) after every write (before ack), which is expensive. 2) New primary can write only after there is a gaurantee that old primary has stopped. This can also be expensive if it has to do it for every write. However if it can do only for the write after primary promotion, it can be optimal. For example, primary when it is promoted, can make the previous term's remote store locations an non-writable. This way, in steady state, there is no tax, only during primary switch, there is one time work of marking the old location non writable. For example, in case of S3, we can use object tags to have a condition that only objects with newPrimaryTerm as tag can be written to that location of the shard. Hopefully most of the remote stores support that kind of ACLs. However if this ACL policy propagation takes time, this could delay new primary promotion. |
Thanks @andrross & @Bukhtawar for ideating on this.
Adding on to @Bukhtawar's point, we can revise the flow when the replica promotion happens -> Initialising shard (in Primary mode) can fetch the common path ( The flow during write request becomes -> Primary receives the request -> It stores the request in the indexing buffer and uploads the translog to the remote store. Translog prefix for remote store:
Agreed to this point. |
Regarding no-op replica write option: how do you handle the case of 'only' copy? What is the guarantee that a primary recovery wont happen unless partitioned out primary has stopped writing? |
Thanks @muralikpbhat yes thats a good point. While tagging improves and helps with the performance it certainly makes the correctness and failover or propagation delays depend on the support for these feature across various cloud providers and their supported consistency models. The "Noop Replication" also serves to ping replica copies and tries to fail stale copies proactively which otherwise would continue to stay in the cluster for longer and cause problems on failovers
We would rely on the no master block to kick in after X seconds(this time can be relaxed to include GC pauses as well). Once this duration has expired we would begin with the restore. Essentially we are relying on an implicit leasing by the master which gets expired once there is no master and the write block sets in. Today too we wait for a failed node to join back for 60s defined by the delay unassigned primary time |
I think this point is the crux of the issue. Keeping the dependency from primary to replicas means that slow or otherwise misbehaving replicas can impact indexing, as you can only be as fast as your slowest replica. "No-op replication" will improve things substantially compared to the status quo given that the replicas only have to do a trivial validation (hence the "no-op"), but completely breaking this dependency is potentially a big win. That being said, the choice here isn't a one-way door. I think it might be reasonable to start with "no-op replication" as it is safer, less intrusive change with fewer requirements on the remote store. @muralikpbhat What do you think? Is it reasonable to start with "no-op replication" or should try to eliminate the primary->replica dependency from the start? |
|
Great points. I am also supportive of beginning with no-op replication. However, it would be ideal to have the design such a way that a remote storage developer can actually implement this differently if desired. Is it possible to abstract this out and give no-op replica as base implementation which can be overridden ? |
Thanks @itiyama for the critical points. Addressing to couple of points -
Noted.
In line with @Bukhtawar's response here, the stale writer would fail and auto restore (enhancement to #3145 ) would kick in reviving the red index - thus adhering to durability tenet.
This would be taken care as part of #3766. This needs to be thought through, but definitely there would be a need for an optimised approach here. |
@muralikpbhat Appreciate your suggestions. So, here is the proposal, we can go with no-op replication for 2.3 and keep it till GA. This will ensure that existing remote stores support segment/translog remote storage. Since correctness is super crucial and core to OpenSearch, it would make sense to get no-op replication approach implemented with durability and correctness honoured with definiteness. After GA, we can have an interface that we can expose as plugin independent from the remote store for Segments and Translog. That interface would need stricter guarantees from plugin extensions like conditional writes or ACL path based writes. However, we will focus on implementing no-op replication (as the only mechanism for handling divergent writes) correctly and with durability as the main tenet for 2.3 and GA. |
Let us be clear on why we are choosing the No-op replication approach. We rely on that approach because it is an existing mechanism of handling divergent writes and we don't want to build a new protocol at this point as any new protocol introduces complexity. The approach relies on primary replica interaction for its correctness. When the only copy of the data is lost, it marks the shard as unassigned. Now, we are digressing from that algorithm where we will recover from data store by introducing leasing selectively on "only copy" failures. Since the data store is not used as part of the agreement(primary term check) during writes in the regular case, it is difficult to say that this approach would work. We are swapping out one part of a complex distributed systems algorithm by introducing a well defined technique. I am very nervous about doing it without building a TLA+ model. If you fail the primary and mark it as unassigned, the protocol does not diverge from existing one. We can then provide customers an option of auto recovering from store via a setting by accepting data loss. Similarly, if you include the remote store in the agreement(primary term check by verifying the term on remote store), the protocol does not diverge. Another question I have about this: If you can make leasing protocol work correctly for a single primary + store, what stops you from making it work correctly with primary + replica + store. In that case, can we get rid of no-op replication altogether? |
Any new changes in replication protocol will need a TLA+ model irrespective IMO. This can be implemented via a daemon thread that checks for leader reachability and allows to check for lease ownership subsequently before acknowledging writes.
We would ideally target a solution that guarantees durability.
The problem with putting remote store in agreement is that the remote store is just a storage extension of their respective current primary or isolated primary copies, however with no-replica copies case this exactly becomes the second option of "Remote Store". The plan going forward is to allow cloud providers to leverage durability without constraints on their consistency models. While also providing extensions to specific remote stores in future which can guarantee strong consistency and offer concurrency controls at low latencies. |
|
We need to handle the case of single shard, where an isolated shard fails itself so that auto restore from remote store can happen. |
Closing this and opening #5672 for tracking failing single shards when they are isolated. |
@ashking94 to your original concern here
How are we going to stop the write in stale primary during no-master block which is partitioned from the master? Can you please share the PR for it? |
@shwetathareja This has been planned for 2.7 release. I will be working on this shortly. #5672 is the issue for the same that you can use to track. |
Thanks @ashking94 . I am expecting #5672 to address generically where last copy of shard was unassigned and it went red as there could be cases where primary & replica got partitioned away from master together or replica was unassigned. |
Fail requests on isolated shards - #6737 |
Is your feature request related to a problem? Please describe.
For a remote translog in
NRTReplicationEngine
we ideally don't need to forward requests to the replica since the translog has already been durably persisted to a remote store. However in the event of a network partition where the primary and replica gets partitioned off, during this period, the primary might continue to ack writes and the replica(now promoted to a new primary) also does the same. This might continue longer espl with the #3045 even when there is no active master. As a result we can run into issues where we can have divergent writes and data integrity issues.To avoid this we need to have a proposal to mitigate such risk
Also see #3237
Describe the solution you'd like
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: