Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default no_master_block from write to metadata_write #3045

Closed
Bukhtawar opened this issue Apr 22, 2022 · 6 comments
Closed

Change default no_master_block from write to metadata_write #3045

Bukhtawar opened this issue Apr 22, 2022 · 6 comments
Assignees
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request

Comments

@Bukhtawar
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
When there is no master in the cluster, or a node is partitioned off the cluster all writes see a 5xx, this leads to availability drop espl when there is a master quorum loss(writes do not need a quorum to succeed)

public static final Setting<ClusterBlock> NO_MASTER_BLOCK_SETTING = new Setting<>(
"cluster.no_master_block",
"write",
NoMasterBlockService::parseNoMasterBlock,
Property.Dynamic,
Property.NodeScope,
Property.Deprecated
);

Describe the solution you'd like
We can switch the default to only fail metadata writes as a part of index creation/dynamic mapping updates but let other writes succeed. Dynamic mapping updates is not the recommended way to handle mappings.
The caveat with this change is is the node is partitioned off and there was no replication configured, rarely the partitioned node can keep taking traffic, acknowledging all writes, reads might not be able to see those writes for as long as the node remains partitioned.
This however is not a problem with replica enabled, since when the primary starts to replicate, the replica node(which automatically gets promoted to primary post partition) rejects those writes coming from the partitioned node, ultimately failing the write request.
So in essence this should help cases where there is loss of quorum due to losing more than one dedicated masters at once transiently

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@Bukhtawar Bukhtawar added enhancement Enhancement or improvement to existing feature or request untriaged labels Apr 22, 2022
@shwetathareja
Copy link
Member

This however is not a problem with replica enabled, since when the primary starts to replicate, the replica node(which automatically gets promoted to primary post partition) rejects those writes coming from the partitioned node, ultimately failing the write request.

@Bukhtawar Network partitioning could still be an issue in case multiple nodes are partitioned like 2 nodes hosting primary & replica both gets partitioned (e.g. in case of index with 1 shard).

One option I am thinking is lets say, there is a background monitor which maintains count of no. of nodes in the cluster always and when no_master _block is applied, data nodes exchange this count and will only accept document writes till the count remains same and at any point if the count changes (it can only go down), then they would stop taking writes. They can figure out this count going down using NodeConnectionsService.java if the connection with any other data node goes down. Thoughts?

@gbbafna
Copy link
Collaborator

gbbafna commented May 30, 2022

@Bukhtawar Network partitioning could still be an issue in case multiple nodes are partitioned like 2 nodes hosting primary & replica both gets partitioned (e.g. in case of index with 1 shard).

Wouldn't that return 5XX to the call and should be okay ? Even in the current scenario of no_master_block , it would return a 5XX to the reads. With metadata_write, the writes would go through and even the reads when it lands on the partitioned nodes.

@Bukhtawar
Copy link
Collaborator Author

@Bukhtawar Network partitioning could still be an issue in case multiple nodes are partitioned like 2 nodes hosting primary & replica both gets partitioned (e.g. in case of index with 1 shard).

@tharejas I think reads would return 5xx as @gbbafna also pointed if both primary and replica shards are partitioned off, however in this case we won't have divergent writes to start with which would otherwise impact data consistency. Do you still have concerns with the proposed change without additional monitoring in place?

@shwetathareja
Copy link
Member

shwetathareja commented Jun 23, 2022

@gbbafna @Bukhtawar :

Listing out different combinations, we should call these out clearly in our documentation:

  1. Index with single shard and no replica and node which has that shard gets partitioned - It can continue to take write and ack to user. Read would succeed or fail depending on the node where the request lands. Later when master comes back either that node joins the cluster or index turns red. (Discussed above already)
  2. Index with replica configured and one of the primary of a shard gets partitioned - Primary will continue to take write and it fails to connect with replica and will try to inform master to fail that replica. So, it will not ack the writes back to the user but would have dirty writes which can be read later. Now, later when master comes back, either this primary joins the cluster and the dirty writes are replayed on replica (need to confirm!) or replica is promoted as primary. We can add integ test for this.
  3. Index with replica configured, both primary and replica are partitioned - Reads and writes both would be consistent if the request lands in partitioned network or fail.
  4. Node with stale cluster state start taking writes - Depending on where the request lands, node with stale state can take up write and ack to user but subsequent read might not return those documents. And later when master comes back, there could be silent data loss as the writes which were ack-ed by stale nodes would be lost.

@Bukhtawar
Copy link
Collaborator Author

Bukhtawar commented Jun 23, 2022

Thanks @shwetathareja good callouts

Index with single shard and no replica and node which has that shard gets partitioned - It can continue to take write and ack to user. Read would succeed or fail depending on the node where the request lands. Later when master comes back either that node joins the cluster or index turns red. (Discussed above already)

I don't see a major concern here since the worst case is same as completely losing the node, which is the case where replication isn't configured. So this case is no worse than what we can have today except that this would take more write requests and then drop them which today wouldn't have been ack'ed in the first place. But the point is single node without replication isn't available or durable and hence the guarantees do not change with this,

Index with replica configured and one of the primary of a shard gets partitioned - Primary will continue to take write and it fails to connect with replica and will try to inform master to fail that replica. So, it will not ack the writes back to the user but would have dirty writes which can be read later. Now, later when master comes back, either this primary joins the cluster and the dirty writes are replayed on replica (need to confirm!) or replica is promoted as primary. We can add integ test for this.

I think if the original primary gets promoted then operations should be replayed to replica, however if the replica gets promoted, those dirty writes can be discarded as well since the global check point on the original primary would still be reflective of the pre-partition state and operations between local and global checkpoint maybe rolled back. We can however confirm the same.

Index with replica configured, both primary and replica are partitioned - Reads and writes both would be consistent if the request lands in partitioned network or fail.

I guess you meant "inconsistent" right? This would also be the case today irrespective of this change, since the original primary might not even know it has been demoted and a replica promoted and will continue to serve reads on dirty writes.

Node with stale cluster state start taking writes - Depending on where the request lands, node with stale state can take up write and ack to user but subsequent read might not return those documents. And later when master comes back, there could be silent data loss as the writes which were ack-ed by stale nodes would be lost.

Super callout!! stale cluster state is poisonous and when there is no master in place they can continue to stay in the cluster(lag detector when master is present causes such nodes to be kicked out within 90s otherwise) for a very long time, however the likelihood of a stale cluster state with no master, should be rare. But definitely worth calling this out explicitly.

@Bukhtawar
Copy link
Collaborator Author

Closing as fixed #3621

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request
Projects
None yet
Development

No branches or pull requests

4 participants