|
| 1 | += Introduce a un-manage cluster mechanism in tendrl |
| 2 | + |
| 3 | +The intent of this change is to introduce an un-manage cluster functionality in |
| 4 | +tendrl. This makes the cluster known to tendrl but not managed anymore, meaning |
| 5 | +the monitoring, alerting and management of the cluster is no more possible from |
| 6 | +tendrl. At later stage (if required) admin can decide to re-import the cluster |
| 7 | +to start managing it again. |
| 8 | + |
| 9 | +The un-manage functionality is helpful for scenario where admin wants to bring |
| 10 | +down the cluster for some critical maintenance activities and doesn't want the |
| 11 | +monitoring etc to be performed for that period. |
| 12 | + |
| 13 | +== Problem description |
| 14 | + |
| 15 | +There are situations when admin needs some critical maintenance of the cluster |
| 16 | +and during this period he doesn't want any monitoring etc taking place. Also |
| 17 | +of he decides to dismantle the cluster at some stage we should have a mechsnism |
| 18 | +using which the cluster could be marked as un-managed from tendrl side. |
| 19 | + |
| 20 | +Tendrl also should provide a provision to re-import the cluster at later stage |
| 21 | +if admin wants and the process should be quite seamless and no or very less |
| 22 | +manual intervention required for this job to be performed. |
| 23 | + |
| 24 | + |
| 25 | +== Use Cases |
| 26 | + |
| 27 | +This addresses the un-managing and re-import an un-managed cluster at later |
| 28 | +stage. The un-manage functionality in tendrl needs to take care of below things |
| 29 | + |
| 30 | +* Un-install any components which got installed as part of tendrl managing the |
| 31 | +storage nodes and disable the services |
| 32 | +* Ste the cluster state properly so that the same is marked and listed as |
| 33 | +un-managed in UI dashboards. No operations should be allowed on the un-managed |
| 34 | +cluster and there should not be any monitoring, alerting or entities management |
| 35 | +supported on this cluster anymore |
| 36 | +* User should have an option to re-import the cluster if needed later and it |
| 37 | +should seamlessly work as usual |
| 38 | + |
| 39 | + |
| 40 | +== Proposed change |
| 41 | + |
| 42 | +* On un-manage cluster start a flow in tendrl server node's node-agent which |
| 43 | +creates child jobs on storage nodes to stop tendrl specific services like |
| 44 | +collectd and tendrl-gluster-integration |
| 45 | + |
| 46 | +* Mark the cluster flag `is_managed` as `False` so that the cluster could be |
| 47 | +listed as un-managed in UI dashboards and all the possible actions could be |
| 48 | +disabled for it |
| 49 | + |
| 50 | +* Archive the graphite (monitoring) data for the cluster in archive location so |
| 51 | +the grafana dashboards dont list the cluster and its entities anymore |
| 52 | + |
| 53 | +* Delete the grafana alert dashboards for the cluster and its dependent entities |
| 54 | + |
| 55 | +The logic here goes like |
| 56 | + |
| 57 | +** Start a flow in node-agent on tendrl server node for un-manage cluster |
| 58 | + |
| 59 | +** The first atom of the above flow invokes child jobs on the storage node's |
| 60 | +node-agent to stop tendrl specific services and marking them dissabled |
| 61 | + |
| 62 | +** In the main atom of the un-manage cluster flow remove if any etcd details for |
| 63 | +the cluster and then mark the cluster is_managed flag as `False` |
| 64 | + |
| 65 | +** One of the atoms now un-manage cluster flow, invokes a flow in |
| 66 | +monitoring-integration to archive the graphite data for the cluser |
| 67 | + |
| 68 | +** Finally another atom invokes a flow in monitoring-integration to remove the |
| 69 | +grafana alert dashboards for the cluster and its dependent entities |
| 70 | + |
| 71 | +So the structure of the un-manage cluster flow would look something as below |
| 72 | + |
| 73 | +``` |
| 74 | +UnmanageCluster: |
| 75 | + tags: |
| 76 | + - "tendrl/monitor" |
| 77 | + atoms: |
| 78 | + - tendrl.objects.Cluster.atoms.StopMonitoringServices |
| 79 | + - tendrl.objects.Cluster.atoms.StopIntegrationServices |
| 80 | + - tendrl.objects.Cluster.atoms.DeleteClusterDetails |
| 81 | + - tendrl.objects.Cluster.atoms.DeleteMonitoringDetails |
| 82 | + help: "Unmanage a Gluster Cluster" |
| 83 | + enabled: true |
| 84 | + inputs: |
| 85 | + mandatory: |
| 86 | + - TendrlContext.integration_id |
| 87 | + run: tendrl.flows.UnmanageCluster |
| 88 | + type: Update |
| 89 | + uuid: 2f94a48a-05d7-408c-b400-e27827f4efed |
| 90 | + version: 1 |
| 91 | +``` |
| 92 | + |
| 93 | +=== Alternatives |
| 94 | + |
| 95 | +None |
| 96 | + |
| 97 | +=== Data model impact |
| 98 | + |
| 99 | +None |
| 100 | + |
| 101 | +=== Impacted Modules: |
| 102 | + |
| 103 | +==== Tendrl API impact: |
| 104 | + |
| 105 | +* Introduce an API `cluster/{int-id}/unmanage` for triggering an un-manage |
| 106 | +cluster fow |
| 107 | + |
| 108 | +==== Notifications/Monitoring impact: |
| 109 | + |
| 110 | +* A flow to archive the cluster specific graphite data |
| 111 | + |
| 112 | +* A flow to remove the grafana alerts dashboards for the cluster and its |
| 113 | +dependent entities |
| 114 | + |
| 115 | +* Raise an alert once cluster got un-managed with details like where to look |
| 116 | +for old graphite data etc |
| 117 | + |
| 118 | +==== Tendrl/common impact: |
| 119 | + |
| 120 | +* A flow un-manage cluster to be tergetted at tendrl server node |
| 121 | + |
| 122 | +==== Tendrl/node_agent impact: |
| 123 | + |
| 124 | +None |
| 125 | + |
| 126 | +==== Sds integration impact: |
| 127 | + |
| 128 | +None |
| 129 | + |
| 130 | +==== Tendrl Dashboard impact: |
| 131 | + |
| 132 | +* UX requirements for invoking an un-manage cluster flow for an existing cluster |
| 133 | +is captured at https://redhat.invisionapp.com/share/8QCOEVEY9 |
| 134 | + |
| 135 | +=== Security impact: |
| 136 | + |
| 137 | +None |
| 138 | + |
| 139 | +=== Other end user impact: |
| 140 | + |
| 141 | +User gets an option to un-mnaage an existing cluster and can re-import at later |
| 142 | +stage |
| 143 | + |
| 144 | +=== Performance impact: |
| 145 | + |
| 146 | +None |
| 147 | + |
| 148 | +=== Other deployer impact: |
| 149 | + |
| 150 | +The tendrl-ansible module need to provide a mechanism to setup tendrl components |
| 151 | +and dependencies on additional new node in the cluster. |
| 152 | + |
| 153 | +<TBD> details to be added here of the plyabooks etc. |
| 154 | + |
| 155 | +=== Developer impact: |
| 156 | + |
| 157 | +None |
| 158 | + |
| 159 | + |
| 160 | +== Implementation: |
| 161 | + |
| 162 | +* https://github.com/Tendrl/commons/issues/797 |
| 163 | + |
| 164 | + |
| 165 | +=== Assignee(s): |
| 166 | + |
| 167 | +Primary assignee: |
| 168 | + shtripat |
| 169 | + mbukatov |
| 170 | + |
| 171 | +=== Work Items: |
| 172 | + |
| 173 | +* https://github.com/Tendrl/specifications/issues/252 |
| 174 | + |
| 175 | + |
| 176 | +== Dependencies: |
| 177 | + |
| 178 | +None |
| 179 | + |
| 180 | +== Testing: |
| 181 | + |
| 182 | +* Check if UI dashboard has an option to trigget un-manage cluster flow |
| 183 | + |
| 184 | +* Check if the flow gets completed successfully and verify if the grafana |
| 185 | +dashboard reflects and cluster details available now for the selected cluster |
| 186 | + |
| 187 | +* Verify that not grafana alert dashboards available now for the un-managed |
| 188 | +cluster |
| 189 | + |
| 190 | +* Verify that the clusters list report the cluster as un-managed and import |
| 191 | +option is enabled now |
| 192 | + |
| 193 | +* Try to import the cluster back and it should be successful. All grafana |
| 194 | +dashboards, grafana alert dashboards and UI reflect the cluster details back |
| 195 | + |
| 196 | +* Invoke the REST end point `clusters/{int-id}/unmanage` and the cluster should |
| 197 | +be un-managed successfully |
| 198 | + |
| 199 | + |
| 200 | +== Documentation impact: |
| 201 | + |
| 202 | +* New un-manage cluster feature should be documented with details like what all |
| 203 | +gets disabled / removed in case a cluster is un-managed |
| 204 | + |
| 205 | +* New API end point should be documented with sample input / output structures |
| 206 | + |
| 207 | +== References: |
| 208 | + |
| 209 | +* https://redhat.invisionapp.com/share/8QCOEVEY9 |
| 210 | + |
| 211 | +* https://github.com/Tendrl/commons/pull/798 |
| 212 | + |
| 213 | +* https://github.com/Tendrl/monitoring-integration/pull/317 |
0 commit comments