-
Notifications
You must be signed in to change notification settings - Fork 16
Added spec specs/unmanage_cluster.adoc #255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
c21d852
eafcc91
53fdc5d
08cad58
1552da3
0887eb1
b25a4b1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,373 @@ | ||
| = Introduce a un-manage cluster mechanism in tendrl | ||
|
|
||
| The intent of this change is to introduce an un-manage cluster functionality in | ||
| tendrl. This makes the cluster known to tendrl but not managed anymore, meaning | ||
| the monitoring, alerting and management of the cluster is no more possible from | ||
| tendrl. At later stage (if required) admin can decide to re-import the cluster | ||
| to start managing it again. | ||
|
|
||
| The un-manage functionality is helpful for scenario where admin wants to bring | ||
| down the cluster for some critical maintenance activities and doesn't want the | ||
| monitoring etc to be performed for that period. | ||
|
|
||
| Also in scenario where there is a failure in cluster import user might need to | ||
| resolve the issues reported while import failure and then re-import the cluster. | ||
| This flow would need an un-manage of the cluster first and the na fresh import | ||
| of the cluster. | ||
|
|
||
| == Problem description | ||
|
|
||
| There are situations when admin needs some critical maintenance of the cluster | ||
| and during this period he doesn't want any monitoring etc taking place. Also | ||
| if he decides to dismantle the cluster at some stage we should have a mechanism | ||
| using which the cluster could be marked as un-managed from tendrl side. | ||
|
|
||
| Tendrl also should provide a provision to re-import the cluster at later stage | ||
| if admin wants and the process should be quite seamless and no or very less | ||
| manual intervention required for this job to be performed. | ||
|
|
||
| In case there is a failure in import cluster, tendrl needs to provide an option | ||
| to un-manage and import the cluster again. | ||
|
|
||
|
|
||
| == Use Cases | ||
|
|
||
| This addresses the un-managing and re-import an un-managed cluster at later | ||
| stage. The un-manage functionality in tendrl needs to take care of below things | ||
|
|
||
| * Stop any services which got started as part of tendrl managing the storage | ||
| nodes and disable the services | ||
|
|
||
| * Set the cluster state properly so that the same is marked and listed as | ||
| un-managed in UI dashboards. No operations should be allowed on the un-managed | ||
| cluster and there should not be any monitoring, alerting or entities management | ||
| supported on this cluster anymore | ||
|
|
||
| * User should have an option to re-import the cluster if needed later and it | ||
| should seamlessly work as usual | ||
|
|
||
| * User should have an option to un-manage a import failed cluster and import it | ||
| again in tendrl | ||
|
|
||
|
|
||
| == Proposed change | ||
|
|
||
| * On un-manage cluster start a flow in tendrl server node's node-agent which | ||
| creates child jobs on storage nodes to stop tendrl specific services like | ||
| collectd and tendrl-gluster-integration | ||
|
|
||
| * Mark the cluster flag `is_managed` as `False` so that the cluster could be | ||
| listed as un-managed in UI dashboards and all the possible actions could be | ||
| disabled for it | ||
|
|
||
| * Delete cluster entity details from tendrl central store | ||
|
|
||
| * Archive the graphite (monitoring) data for the cluster in archive location so | ||
| the grafana dashboards dont list the cluster and its entities anymore | ||
|
|
||
| * Delete the grafana alert dashboards for the cluster and its dependent entities | ||
|
|
||
| The logic here goes like | ||
|
|
||
| ** Start a flow in node-agent on tendrl server node for un-manage cluster | ||
|
|
||
| ** The first atom of the above flow invokes child jobs on the storage node's | ||
| node-agent to stop tendrl specific services and marking them disabled | ||
|
|
||
| ** In the main atom of the un-manage cluster flow remove if any etcd details for | ||
| the cluster and then mark the cluster is_managed flag as `False` | ||
|
|
||
| ** One of the atoms now un-manage cluster flow, invokes a flow in | ||
| monitoring-integration to archive the graphite data for the cluster | ||
|
|
||
| ** Finally another atom invokes a flow in monitoring-integration to remove the | ||
| grafana alert dashboards for the cluster and its dependent entities | ||
|
|
||
| So the structure of the un-manage cluster flow would look something as below | ||
|
|
||
| ``` | ||
| UnmanageCluster: | ||
| tags: | ||
| - "tendrl/monitor" | ||
| atoms: | ||
| - tendrl.objects.Cluster.atoms.StopMonitoringServices | ||
| - tendrl.objects.Cluster.atoms.StopIntegrationServices | ||
| - tendrl.objects.Cluster.atoms.DeleteClusterDetails | ||
| - tendrl.objects.Cluster.atoms.DeleteMonitoringDetails | ||
| help: "Unmanage a Gluster Cluster" | ||
| enabled: true | ||
| inputs: | ||
| mandatory: | ||
| - TendrlContext.integration_id | ||
| run: tendrl.flows.UnmanageCluster | ||
| type: Update | ||
| uuid: 2f94a48a-05d7-408c-b400-e27827f4efed | ||
| version: 1 | ||
| ``` | ||
|
|
||
| * While import flow in progress the values of `current_job` and `status` | ||
| should be set with `{'job_id': 'import job id', 'job_name': 'ImportCluster', | ||
| 'status': 'in_progress'}` id and `Importing` respectively | ||
|
|
||
| * Once import flow is successful the value of `status` would be set as `done` | ||
|
|
||
| * If import flow fails the value of `status` would be set as `failed` | ||
|
|
||
| * While un-manage flow in progress the values of `current_job` and `status` | ||
| should be set with `{'job_id': 'unmanage job id', 'job_name': 'ImportCluster', | ||
| 'status': 'in_progress'}` and `Unmanaging` respectively | ||
|
|
||
| * Once un-manage flow is successful the value of `status` would be set as `done` | ||
|
|
||
| * If un-manage flow fails the value of `status` would be set as `failed` | ||
|
|
||
| * If an import cluster fails tendrl UI needs to keep import cluster option open | ||
| and if user selects the option, it should throw a dialog telling about the | ||
| previous import failure and if user confirms to go ahead about un-manage and | ||
| then import the cluster, UI should submit an un-manage cluster first. If the | ||
| un-manage cluster task succeeds, then UI should submit a import for the same | ||
| cluster | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have a question here. If user chooses to click on import button again how will the
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ltrilety If an unmanage ran successfully, then import ran successfully/unsuccessfully, it will show the import. If an unmanage did not run successfully (then import does not run), it will show the unmanage. |
||
|
|
||
| * UI needs to have client side storage option to retain the previous un-manage | ||
| cluster task-id for reference and for showing the details of the tasks in UI | ||
|
|
||
| * So if there is an import failure for a cluster user tries import again for the | ||
| cluster after user confirmation UI submits two tasks one by one. One for | ||
| un-manage cluster and after success import cluster. UI should maintain both the | ||
| tasks details for detailing in UI | ||
|
|
||
|
|
||
| === Alternatives | ||
|
|
||
| None | ||
|
|
||
| === Data model impact | ||
|
|
||
| * Change the fields `import_job_id` and `import_status` as `current_job` and | ||
| `status` respectively for cluster entity | ||
|
|
||
| * The same fields would be updated with appropriate details while import and | ||
| un-manage flows on cluster | ||
|
|
||
| * The field `current_job` would maintain a dict containing `status`, `job_name` | ||
| and `job_id` for currently running job on cluster | ||
|
|
||
| * The field `status` would maintain values like `importing`, `unmanaging`, | ||
| `syncing` or `unknown` at a time. This maintains any flows running status on the | ||
| cluster | ||
|
|
||
| === Impacted Modules: | ||
|
|
||
| ==== Tendrl API impact: | ||
|
|
||
| * Introduce an API `cluster/{int-id}/unmanage` for triggering an un-manage | ||
| cluster flow | ||
|
|
||
| ==== Notifications/Monitoring impact: | ||
|
|
||
| * A flow to archive the cluster specific graphite data | ||
|
|
||
| * A flow to remove the grafana alerts dashboards for the cluster and its | ||
| dependent entities | ||
|
|
||
| * Raise an alert once cluster got un-managed with details like where to look | ||
| for old graphite data etc | ||
|
|
||
| ==== Tendrl/common impact: | ||
|
|
||
| * A flow un-manage cluster to be targeted at tendrl server node | ||
|
|
||
| ==== Tendrl/node_agent impact: | ||
|
|
||
| None | ||
|
|
||
| ==== Sds integration impact: | ||
|
|
||
| None | ||
|
|
||
| ==== Tendrl Dashboard impact: | ||
|
|
||
| * Following changes required in UI dashboards based on UX designs mentioned at | ||
| https://redhat.invisionapp.com/share/8QCOEVEY9 | ||
|
|
||
| ** Add an option namely `Unmanage` under kebab menu for each successfully | ||
| imported and managed cluster | ||
|
|
||
| ** Add a dialog box which opens up on click event of `Unmanage` option from | ||
| kebab menu of the cluster. This dialog box is for confirmation from user to | ||
| start un-manage flow for the cluster | ||
|
|
||
| ===== Workflow | ||
|
|
||
| * User clicks the `Unmanage` option from the kebab menu for a managed cluster | ||
|
|
||
| * The click event triggers a dialog box with appropriate message. A sample | ||
| message is available at | ||
| https://redhat.invisionapp.com/share/8QCOEVEY9#/screens/273239640 | ||
|
|
||
| * There are 3 possible actions on this dialog | ||
|
|
||
| ** `Close` icon to close the dialog and no action performed for un-managing the | ||
| cluster. User would be directed back to clusters list page | ||
|
|
||
| ** `Cancel` button to close the dialog and no action performed for un-managing the | ||
| cluster. User would be directed back to clusters list page | ||
|
|
||
| ** `Unmanage` button to start the un-manage cluster task in backend. A message | ||
| with task details gets displayed on dialog box. Sample message available at | ||
| https://redhat.invisionapp.com/share/8QCOEVEY9#/screens/273239844 | ||
|
|
||
| ** This final message after submission of the task for un-managing cluster would | ||
| also provide a button to view the task details. A button `View Task Progress` is | ||
| available for the same. User can opt to close this dialog and later user context | ||
| menus to check the task updates | ||
|
|
||
| ** Once a cluster is being moved to un-managed state, the changes in properties | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this after cluster moved to un-manage state or just after the task is created and in-progress?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Its while un-manage in progress. @a2batic can you ack this?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @shtripat @nthomas-redhat, its when the task for unmanage has been submitted from UI and API has acknowledged the submission of task. |
||
| listed for cluster are as below | ||
|
|
||
| *** `Import Status` changed to `Unmanaging` | ||
|
|
||
| *** `Is Managed` changed to `no` | ||
|
|
||
| *** The columns `Volume Profiling`, `Volumes` and `Alerts` would be hidden | ||
|
|
||
| *** `View Details` link would be available to check the task details | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @a2batic @a2batic @shtripat Will the cluster list API response provides unmanage task_id ? What will be the value of is_managed property after triggering Unmanage action and before the unmanage task gets completed? @a2batic Please add the API details too for eg. - API url, request data, response.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. View Details is used to show error list, it is also used to show import task details (as it's importing). See https://redhat.invisionapp.com/share/8QCOEVEY9#/screens/244738628. For the unmanaged task details, it will be similar to the import task details view. Reason is there is no active cluster anymore during an unmanage cluster. |
||
|
|
||
| *** `Dashboard` button would be disabled | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @julienlim @mcarrano, we added to disable 'Dashboard' button when unmanage is in progress, but design[1] has 'import' button disabled. [1] https://redhat.invisionapp.com/share/8QCOEVEY9#/screens/244738628
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After Unmanage action is clicked, you will see get the modal (https://redhat.invisionapp.com/share/8QCOEVEY9#/screens/273239640) followed by https://redhat.invisionapp.com/share/8QCOEVEY9#/screens/273239844. Once you close out of the modals, the cluster page now show for the cluster being unmanaged what's shown in https://redhat.invisionapp.com/share/8QCOEVEY9#/screens/244738628 (the row pointed to by annotation #13) or https://redhat.invisionapp.com/share/8QCOEVEY9#/screens/244738625 (last row in the cluster list). There is no "Dashboard" button as when you unmanage, there should no longer be Dashboard access. |
||
| *** Kebab menu for the un-managed cluster would be hidden | ||
|
|
||
| ** Once the un-manage cluster task gets completed a global notification gets | ||
| received | ||
|
|
||
| ** If task was successful, the state of the cluster would be changed to ready to | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I assume that cluster data is removed completely from etcd and this cluster detection logic detects it as a fresh cluster
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A fresh cluster detected is actually a ready to import cluster |
||
| import | ||
|
|
||
| If task failed due to some issues, the cluster details would listed as below in | ||
|
|
||
| *** `Import Status` changed to `Unmanage failed` | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suppose it failed half-way through, what is way forward?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ideally user would be allowed to execute un-manage again for the cluster (as discussed in 13th Feb 2018 arch call)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @shtripat @nthomas-redhat @julienlim @mcarrano , user will be able to execute unmanage again, but before the user confirms unmanage, there should be a popup saying "The cluster unamange has been failed once with the <job_id>, it is recommended to resolve the error and then unmanage" or something similar ? |
||
|
|
||
| *** `Is managed` changed to `no` | ||
|
|
||
| *** The columns `Volume Profiling`, `Volumes` and `Alerts` would be hidden | ||
|
|
||
| *** `View Details` link would be available to check the errors | ||
|
|
||
| *** `Dashboard` button would be disabled | ||
|
|
||
| *** Kebab menu for the un-managed cluster would be hidden | ||
|
|
||
| * If a previous import failed or cluster is in mis-configured state after import | ||
| (import failed with errors field not populated for cluster), the import and | ||
| un-manage both the options would be enabled in UI. If user selects the import | ||
| option now, it lands in import cluster view/page. If there was a previous import | ||
| failed, then modal dialog shows up and message would be something like `Import | ||
| cluster previously failed with <job_id>. Before import, you need to correct the | ||
| issues and then un-manage the cluster`. This dialog has `Ok` and `Cancel` | ||
| buttons. | ||
|
|
||
| * If un-manage fails, it would provide a tooltip/info with failure message `If | ||
| un-manage fails, resolve the issue and then try un-manage cluster again`. It | ||
| would show a message to say `Unmanage Cluster` failed having a `View Details` | ||
| hyperlink in the cluster list view. | ||
|
|
||
|
|
||
| === Security impact: | ||
|
|
||
| None | ||
|
|
||
| === Other end user impact: | ||
|
|
||
| User gets an option to un-manage an existing cluster and can re-import at later | ||
| stage | ||
|
|
||
| === Performance impact: | ||
|
|
||
| None | ||
|
|
||
| === Other deployer impact: | ||
|
|
||
| The tendrl-ansible module need to provide a mechanism to setup tendrl components | ||
| and dependencies on additional new node in the cluster. | ||
|
|
||
| <TBD> details to be added here of the plyabooks etc. | ||
|
|
||
| === Developer impact: | ||
|
|
||
| None | ||
|
|
||
|
|
||
| == Implementation: | ||
|
|
||
| * https://github.com/Tendrl/commons/issues/797 | ||
|
|
||
|
|
||
| === Assignee(s): | ||
|
|
||
| Primary assignee: | ||
| shtripat | ||
| mbukatov | ||
| a2batic | ||
|
|
||
| === Work Items: | ||
|
|
||
| * https://github.com/Tendrl/specifications/issues/252 | ||
|
|
||
|
|
||
| == Dependencies: | ||
|
|
||
| * https://github.com/Tendrl/api/issues/349 | ||
|
|
||
| == Testing: | ||
|
|
||
| * Check if UI dashboard has an option to trigger un-manage cluster flow | ||
|
|
||
| * Check if the flow gets completed successfully and verify if the grafana | ||
| dashboard reflects and cluster details available now for the selected cluster | ||
|
|
||
| * Verify that no grafana alert dashboards available now for the un-managed | ||
| cluster | ||
|
|
||
| * Verify that the clusters list report the cluster as un-managed and import | ||
| option is enabled now | ||
|
|
||
| * Try to import the cluster back and it should be successful. All grafana | ||
| dashboards, grafana alert dashboards and UI reflect the cluster details back | ||
|
|
||
| * Invoke the REST end point `clusters/{int-id}/unmanage` and the cluster should | ||
| be un-managed successfully | ||
|
|
||
| * On un-manage cluster completion, the alert dashboards in grafana would vanish | ||
| for the entities of the cluster like volume, bricks etc. Verify to make sure the | ||
| same happens as expected | ||
|
|
||
| * Once cluster is un-managed the details of the cluster would vanish from | ||
| dashboards in grafana. Verify the same happens as expected | ||
|
|
||
| * Verify that the final alert post un-manage flow, tells about removal of | ||
| details from grafana dashboards and grafana alert dashboards | ||
|
|
||
| * Verify the scenatio when a cluster import fails, and user is able to start | ||
| a un-manage + reimport cluster option from UI. UI should be able to list details | ||
| of both the tasks in this scenario | ||
|
|
||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's one other scenario which should be tested and that's when un-manage fails. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There should be also tested if services |
||
|
|
||
| == Documentation impact: | ||
|
|
||
| * New un-manage cluster feature should be documented with details like what all | ||
| gets disabled / removed in case a cluster is un-managed | ||
|
|
||
| * New API end point should be documented with sample input / output structures | ||
|
|
||
| * The expected behavior post un-manage call in grafana dashboards should be | ||
| clearly mentioned in documents | ||
|
|
||
| == References: | ||
|
|
||
| * https://redhat.invisionapp.com/share/8QCOEVEY9 | ||
|
|
||
| * https://github.com/Tendrl/commons/pull/798 | ||
|
|
||
| * https://github.com/Tendrl/monitoring-integration/pull/317 | ||
|
|
||
| * https://github.com/Tendrl/ui/issues/801 | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are deleting the cluster data also from the central store, please mention that as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add a line talking about the same