-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change default no_master_block from write to metadata_write #3045
Comments
@Bukhtawar Network partitioning could still be an issue in case multiple nodes are partitioned like 2 nodes hosting primary & replica both gets partitioned (e.g. in case of index with 1 shard). One option I am thinking is lets say, there is a background monitor which maintains count of no. of nodes in the cluster always and when no_master _block is applied, data nodes exchange this count and will only accept document writes till the count remains same and at any point if the count changes (it can only go down), then they would stop taking writes. They can figure out this count going down using |
Wouldn't that return 5XX to the call and should be okay ? Even in the current scenario of |
@tharejas I think reads would return 5xx as @gbbafna also pointed if both primary and replica shards are partitioned off, however in this case we won't have divergent writes to start with which would otherwise impact data consistency. Do you still have concerns with the proposed change without additional monitoring in place? |
Listing out different combinations, we should call these out clearly in our documentation:
|
Thanks @shwetathareja good callouts
I don't see a major concern here since the worst case is same as completely losing the node, which is the case where replication isn't configured. So this case is no worse than what we can have today except that this would take more write requests and then drop them which today wouldn't have been ack'ed in the first place. But the point is single node without replication isn't available or durable and hence the guarantees do not change with this,
I think if the original primary gets promoted then operations should be replayed to replica, however if the replica gets promoted, those dirty writes can be discarded as well since the global check point on the original primary would still be reflective of the pre-partition state and operations between local and global checkpoint maybe rolled back. We can however confirm the same.
I guess you meant "inconsistent" right? This would also be the case today irrespective of this change, since the original primary might not even know it has been demoted and a replica promoted and will continue to serve reads on dirty writes.
Super callout!! stale cluster state is poisonous and when there is no master in place they can continue to stay in the cluster(lag detector when master is present causes such nodes to be kicked out within 90s otherwise) for a very long time, however the likelihood of a stale cluster state with no master, should be rare. But definitely worth calling this out explicitly. |
Closing as fixed #3621 |
Is your feature request related to a problem? Please describe.
When there is no master in the cluster, or a node is partitioned off the cluster all writes see a 5xx, this leads to availability drop espl when there is a master quorum loss(writes do not need a quorum to succeed)
OpenSearch/server/src/main/java/org/opensearch/cluster/coordination/NoMasterBlockService.java
Lines 74 to 81 in 1193ec1
Describe the solution you'd like
We can switch the default to only fail metadata writes as a part of index creation/dynamic mapping updates but let other writes succeed. Dynamic mapping updates is not the recommended way to handle mappings.
The caveat with this change is is the node is partitioned off and there was no replication configured, rarely the partitioned node can keep taking traffic, acknowledging all writes, reads might not be able to see those writes for as long as the node remains partitioned.
This however is not a problem with replica enabled, since when the primary starts to replicate, the replica node(which automatically gets promoted to primary post partition) rejects those writes coming from the partitioned node, ultimately failing the write request.
So in essence this should help cases where there is loss of quorum due to losing more than one dedicated masters at once transiently
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: