-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scale down #223
Comments
From today's sync-up:
I didn't know what 'scaling down doesn't work' meant in this issue, so I dug into it; there are three/four failure modes to scaling down not working:
|
For reference, the bitnami helm chart does allow for scaling down, but from my testing it does this by recreating the cluster, rather than removing single nodes. |
@harshac and I played about with the helm chart today. It seems it has a persistent disk so that even when you scale the cluster & it recreates all of the nodes, it still persists any durable queues or persistent messages between iterations of the cluster. |
Removing individual nodes would be much less disruptive with the introduction of maintenance mode in |
RabbitMQ server does support permanent node removal from an existing cluster. Raft-based features such as quorum queues and MQTT client ID tracker require such node to also be explicitly removed from the Raft member list using, see rabbitmq-queues help delete_member rabbitmqctl help decommission_mqtt_node The problematic part with downsizing is that the reduced number of nodes may or may not handle the same load on the system gracefully. A five-node cluster where all nodes run with the default open file handle limit of 1024 can sustain about 5000 connections but a three-node cluster would not be able to do that. |
It doesn't look like there is any native StatefulSet hook for specifically scaling down. The closest I've found is this issue, which covers our case exactly. It points out that there is no way to gracefully decommission a node in Kubernetes StatefulSets. We will likely have to implement custom logic in the controller for this. Perhaps we can detect in the controller where the |
Indeed, it'd be great to see that as a native feature as described in that issue. For the time being, we could implement what you said which is more or less what we already have when deleting the cluster: https://github.com/rabbitmq/cluster-operator/blob/main/controllers/rabbitmqcluster_controller.go#L471. We can keep the replicas count untouched until the pod/node is successfully deleted/forgotten. I'm not sure about this part:
|
This issue has been marked as stale due to 60 days of inactivity. Stale issues will be closed after a further 30 days of inactivity; please remove the stale label in order to prevent this occurring. |
Do you have a solution to scale down the pod at this moment? I edit a configuration on RabbitMQ instance for a replica from 3 to 1 but nothing happened. |
Scale down is not supported by the cluster operator and it's not a planed feature for us atm. Reducing the number of replicas is ignored by the cluster operator, and if you check the operator logs and published events for your RabbitmqCluster, there should be a line says "Cluster Scale down not supported". |
Just ran across this. I have a development, single-node cluster set up that has 20ish devs, each with their own username and vhost. While doing some testing, I accidently scaled it up to a 3 node cluster, and now I have no way to bring it back down to one except to destroy and re-create it. Not a big deal, except I'll have to reconfigure all the users, vhosts and their permissions all over again. With as easy as it is to change "replicas: 1 -> 3", it would have been nice to have a similar experience from 3 -> 1 (at least in a non-production setting). |
Unfortunately making the scale down easy is exactly what we don't want to do until it's well supported (because it would make it easy to lose data). I understand that in your case it's a non-production system but I don't see how we could make it easy only in that case. Having said that, if your are only concerned with the definitions (vhosts, users, permissions, etc), you can easily take care of them by exporting the definitions and importing them to a new cluster. Something like this:
It's a bit mundane but will do the trick. You can replace the last few steps with an import on node startup (see the example. Keep in mind two caveats:
|
Hi @mkuratczyk, just want to know any update about this feature? |
It's not. Please provide the details of your situation and what you'd expect to happen and we can try to provide a workaround or will take such a use-case into account when working on that. |
Why do you want to scale down to zero? RabbitMQ is a stateful, clustered, application. It can't start in a fraction of a second when an application tries to connect. Is this some test environment that you want to turn off for the weekend or something like that? This question is what I meant by your use case. If it is a test env, there are a few things you can try:
But again - please let us know what you are trying to achieve. RabbitMQ is more like a database. I'm not sure why you would run nginx in a statefulset but they are fundamentally different applications so you can't just expect the manage them the same way. |
For a situation where queue messages are not important can we do this? |
Please read through this thread and try if you want. I'd expect you to still need to run You didn't share your use case either. Why do you want to go from 3 nodes to 2? 2-nodes RabbitMQ clusters are hardly supported in the first place. Quorum queues and streams require 3 nodes (or just 1 for test envs). If you don't care about your messages - there are other options (running single node in the first place, running a cluster with no persistent storage). |
i am working with @yuzsun. our use case is this is a test environment and we want to stop kubernetes cluster running rabbitmq during weekend. I will look into the "storage: 0" option |
I tested that and worked well without even calling |
@AminSojoudi There's a workaround to scale down, assuming you are ok with potential data loss. If the 4th instance never moved from
For your use case @ksooner, you can skip steps 3-5, since you simply want to scale to 0 over the weeked. The PVC will remain if you scale down to 0, so previous durable exchanges/queues will still be there when you scale back to 3. Also, you may get Pods in I'd like to re-state that scale down is not supported by the Operator, and this workaround risks data loss, and effectively bypasses most, if not all, data safety measures we've incorporated in the Operator. |
This was really helpful thank you for sharing |
@mkuratczyk I'm using K8s on bare-metal and I'm not using shared storage operators. I have a two cases when scaling down looks like can help:
As I can't just move PVs between K8s nodes I've to scale up and scale down cluster for each node one by one. For case N1 I can scale up cluster first for +1 new node, scale down for -1 old node. Is it valid cases for you? Or I can do something different? |
Just to be clear, I know there are cases where scale-down may be helpful. It's just that it's a hard problem to solve in a generic way (that will work for different cases, with different queue types and so on). |
This comment was marked as duplicate.
This comment was marked as duplicate.
A good case for why the scaledown is useful is for cost saving purposes. For example, scaling down all pods in development namespace to 0 during the inactive hours thus reducing the required node count. If the operator does not support it, then we have a dangling pod left in the namespace. |
We realize there are use cases, they just don't make the problem any simpler. :) Development environments are the easiest case unless you have some special requirements. What's the benefit of scaling down to zero compared to just deleting and redeploying the cluster the following day? You can keep your definitions in a JSON file or also as Kubernetes resources, so that the cluster has all the necessary users, queues and so on when it starts. |
The other case for cluster scaling down: Loop experts @mkuratczyk @Zerpet for awareness |
Should label rabbitmqcluster as pauseReconcile=true to make CRD stop watching the cluster firstly ? |
@xixiangzouyibian RabbitMQ cluster membership and quorum queue membership are separate concerns. Scaling down the RabbitMQ cluster cleanly would require scaling down quorum queues as well. However, for quorum queue membership changes, the quorum of nodes need to be available. For situations like you described, an API was recently added to force a single quorum queue (a Ra/Raft cluster) member to assume it's the only member now: rabbitmq/ra#306. This is a very dangerous operation and will likely will lead to data loss. You should not lose 3 out of 5 members in the first place. |
Yes, I aware I can successfully declare quorum queue in single rabbitmq node now. Thanks for this PR. |
What I mean is that forgetting a RabbitMQ cluster node doesn't automatically remove it from the RAFT cluster of the quorum queue(s). That would need to be handled separately and the PR I linked to allows to do that (it basically tells one quorum queue / RAFT node "you are the only survivor, forget all other nodes and become the leader", which means if that node wasn't up to date with the old leader, the delta will be lost). Declaring quorum queues in a single node RabbitMQ has always been possible, that's unrelated to this PR. |
Let me tidy it up:
|
If you lose quorum, the queue is not available without human intervention (since RabbitMQ doesn't know that the nodes were lost forever). The PR provides a way to intervene, by selecting a single queue member as the new leader and only member. The only difference when you have two survivors is that you can pick one of them to be the only one / new leader, so if their state differs, you can pick the more up to date one. Normally this is handled by leader election but the normal election process can't take place since we lost quorum. |
Got it, Thanks a lot ! |
Seems like 3.12 should have it. I haven't tried. And once again: this is the last resort kind of solution. Do not use it unless you are sure you want to. |
The current version of RabbitMQ controller doesn't support scaling down; directly reducing the replicas of the stateful set can lead to quorum loss and even data loss. More details in [this issue](rabbitmq/cluster-operator#223). Thus, we decide to disallow downscale for the stateful set in our implementation. Using the [validation rule](https://github.com/vmware-research/verifiable-controllers/blob/f5236647bf4fb26daa1359fde3c61a282a886735/src/controller_examples/rabbitmq_controller/spec/rabbitmqcluster.rs#L108) can guarantee that updating the deployment won't decrease the replicas. But here is a corner case: a workaround for downscale operation is to delete the current deployment and create a new one with fewer replicas, which doesn't violate the validation rule. If the execution happens in this way, chances are that the stateful set created from the new cr may not have been deleted by the garbage collector when the controller tries to update the stateful set with the new cr which has a smaller `replicas` field. Thus, the controller implementation still needs to compare the old and new replicas before updating stateful set to make sure scaling down doesn't happen. This makes the proof a lot more difficult, because we have to show that the replicas of old stateful set is lower than that of the current cr, whose proof requires us to show that if a stateful set has owner reference pointing to some cr, its replicas is no larger. Therefore, we decide to let the controller wait for the garbage collector to delete the old stateful set, which avoids the corner case, and does not introduce too much complexity to the proof since it's needless to compare the stateful set now. In this case, if the old stateful set doesn't have an owner reference pointing to the current cr, the reconcile will simply go to the error state and wait for the next round of reconcile. To make the liveness proof work, I change `the_object_in_reconcile_has_spec_as` into `the_object_in_reconcile_has_spec_and_uid_as` so that the owner_references can also be derived from the desired custom resource. The left work is as follows: 1. Add an eventual safety property showing that `spec /\ []desired_state_is(cr) |= true ~> (sts_key_exists => sts.owner_references.contains(cr.owner_ref()))`. 2. Add `[](sts_key_exists => sts.owner_references.contains(cr.owner_ref()))` to the assumption of the spec. 3. Add reasoning about the steps after create/update server config map and the stateful set part should be similar as zookeeper as long as 1 and 2 are done. --------- Signed-off-by: Wenjie Ma <rebekah1368@gmail.com>
Is your feature request related to a problem? Please describe.
Scaling down a rabbitmq cluster does not work at the moment. We should look into why it fails, and what we need to do to support scaling down.
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
rabbitmqctl forget_cluster_node
might be needed before we deleting podspreStop
hooksThe text was updated successfully, but these errors were encountered: