Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
373 changes: 373 additions & 0 deletions specs/unmanage_cluster.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,373 @@
= Introduce a un-manage cluster mechanism in tendrl

The intent of this change is to introduce an un-manage cluster functionality in
tendrl. This makes the cluster known to tendrl but not managed anymore, meaning
the monitoring, alerting and management of the cluster is no more possible from
tendrl. At later stage (if required) admin can decide to re-import the cluster
to start managing it again.

The un-manage functionality is helpful for scenario where admin wants to bring
down the cluster for some critical maintenance activities and doesn't want the
monitoring etc to be performed for that period.

Also in scenario where there is a failure in cluster import user might need to
resolve the issues reported while import failure and then re-import the cluster.
This flow would need an un-manage of the cluster first and the na fresh import
of the cluster.

== Problem description

There are situations when admin needs some critical maintenance of the cluster
and during this period he doesn't want any monitoring etc taking place. Also
if he decides to dismantle the cluster at some stage we should have a mechanism
using which the cluster could be marked as un-managed from tendrl side.

Tendrl also should provide a provision to re-import the cluster at later stage
if admin wants and the process should be quite seamless and no or very less
manual intervention required for this job to be performed.

In case there is a failure in import cluster, tendrl needs to provide an option
to un-manage and import the cluster again.


== Use Cases

This addresses the un-managing and re-import an un-managed cluster at later
stage. The un-manage functionality in tendrl needs to take care of below things

* Stop any services which got started as part of tendrl managing the storage
nodes and disable the services

* Set the cluster state properly so that the same is marked and listed as
un-managed in UI dashboards. No operations should be allowed on the un-managed
cluster and there should not be any monitoring, alerting or entities management
supported on this cluster anymore

* User should have an option to re-import the cluster if needed later and it
should seamlessly work as usual

* User should have an option to un-manage a import failed cluster and import it
again in tendrl


== Proposed change

* On un-manage cluster start a flow in tendrl server node's node-agent which
creates child jobs on storage nodes to stop tendrl specific services like
collectd and tendrl-gluster-integration

* Mark the cluster flag `is_managed` as `False` so that the cluster could be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are deleting the cluster data also from the central store, please mention that as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add a line talking about the same

listed as un-managed in UI dashboards and all the possible actions could be
disabled for it

* Delete cluster entity details from tendrl central store

* Archive the graphite (monitoring) data for the cluster in archive location so
the grafana dashboards dont list the cluster and its entities anymore

* Delete the grafana alert dashboards for the cluster and its dependent entities

The logic here goes like

** Start a flow in node-agent on tendrl server node for un-manage cluster

** The first atom of the above flow invokes child jobs on the storage node's
node-agent to stop tendrl specific services and marking them disabled

** In the main atom of the un-manage cluster flow remove if any etcd details for
the cluster and then mark the cluster is_managed flag as `False`

** One of the atoms now un-manage cluster flow, invokes a flow in
monitoring-integration to archive the graphite data for the cluster

** Finally another atom invokes a flow in monitoring-integration to remove the
grafana alert dashboards for the cluster and its dependent entities

So the structure of the un-manage cluster flow would look something as below

```
UnmanageCluster:
tags:
- "tendrl/monitor"
atoms:
- tendrl.objects.Cluster.atoms.StopMonitoringServices
- tendrl.objects.Cluster.atoms.StopIntegrationServices
- tendrl.objects.Cluster.atoms.DeleteClusterDetails
- tendrl.objects.Cluster.atoms.DeleteMonitoringDetails
help: "Unmanage a Gluster Cluster"
enabled: true
inputs:
mandatory:
- TendrlContext.integration_id
run: tendrl.flows.UnmanageCluster
type: Update
uuid: 2f94a48a-05d7-408c-b400-e27827f4efed
version: 1
```

* While import flow in progress the values of `current_job` and `status`
should be set with `{'job_id': 'import job id', 'job_name': 'ImportCluster',
'status': 'in_progress'}` id and `Importing` respectively

* Once import flow is successful the value of `status` would be set as `done`

* If import flow fails the value of `status` would be set as `failed`

* While un-manage flow in progress the values of `current_job` and `status`
should be set with `{'job_id': 'unmanage job id', 'job_name': 'ImportCluster',
'status': 'in_progress'}` and `Unmanaging` respectively

* Once un-manage flow is successful the value of `status` would be set as `done`

* If un-manage flow fails the value of `status` would be set as `failed`

* If an import cluster fails tendrl UI needs to keep import cluster option open
and if user selects the option, it should throw a dialog telling about the
previous import failure and if user confirms to go ahead about un-manage and
then import the cluster, UI should submit an un-manage cluster first. If the
un-manage cluster task succeeds, then UI should submit a import for the same
cluster

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question here. If user chooses to click on import button again how will the View Details link work? Will it link first to Task Details of un-manage task and later details of import task?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ltrilety
If a user chooses to click on Import button, the View Details will show the link to the last task run.

If an unmanage ran successfully, then import ran successfully/unsuccessfully, it will show the import.

If an unmanage did not run successfully (then import does not run), it will show the unmanage.


* UI needs to have client side storage option to retain the previous un-manage
cluster task-id for reference and for showing the details of the tasks in UI

* So if there is an import failure for a cluster user tries import again for the
cluster after user confirmation UI submits two tasks one by one. One for
un-manage cluster and after success import cluster. UI should maintain both the
tasks details for detailing in UI


=== Alternatives

None

=== Data model impact

* Change the fields `import_job_id` and `import_status` as `current_job` and
`status` respectively for cluster entity

* The same fields would be updated with appropriate details while import and
un-manage flows on cluster

* The field `current_job` would maintain a dict containing `status`, `job_name`
and `job_id` for currently running job on cluster

* The field `status` would maintain values like `importing`, `unmanaging`,
`syncing` or `unknown` at a time. This maintains any flows running status on the
cluster

=== Impacted Modules:

==== Tendrl API impact:

* Introduce an API `cluster/{int-id}/unmanage` for triggering an un-manage
cluster flow

==== Notifications/Monitoring impact:

* A flow to archive the cluster specific graphite data

* A flow to remove the grafana alerts dashboards for the cluster and its
dependent entities

* Raise an alert once cluster got un-managed with details like where to look
for old graphite data etc

==== Tendrl/common impact:

* A flow un-manage cluster to be targeted at tendrl server node

==== Tendrl/node_agent impact:

None

==== Sds integration impact:

None

==== Tendrl Dashboard impact:

* Following changes required in UI dashboards based on UX designs mentioned at
https://redhat.invisionapp.com/share/8QCOEVEY9

** Add an option namely `Unmanage` under kebab menu for each successfully
imported and managed cluster

** Add a dialog box which opens up on click event of `Unmanage` option from
kebab menu of the cluster. This dialog box is for confirmation from user to
start un-manage flow for the cluster

===== Workflow

* User clicks the `Unmanage` option from the kebab menu for a managed cluster

* The click event triggers a dialog box with appropriate message. A sample
message is available at
https://redhat.invisionapp.com/share/8QCOEVEY9#/screens/273239640

* There are 3 possible actions on this dialog

** `Close` icon to close the dialog and no action performed for un-managing the
cluster. User would be directed back to clusters list page

** `Cancel` button to close the dialog and no action performed for un-managing the
cluster. User would be directed back to clusters list page

** `Unmanage` button to start the un-manage cluster task in backend. A message
with task details gets displayed on dialog box. Sample message available at
https://redhat.invisionapp.com/share/8QCOEVEY9#/screens/273239844

** This final message after submission of the task for un-managing cluster would
also provide a button to view the task details. A button `View Task Progress` is
available for the same. User can opt to close this dialog and later user context
menus to check the task updates

** Once a cluster is being moved to un-managed state, the changes in properties
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this after cluster moved to un-manage state or just after the task is created and in-progress?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its while un-manage in progress. @a2batic can you ack this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shtripat @nthomas-redhat, its when the task for unmanage has been submitted from UI and API has acknowledged the submission of task.

listed for cluster are as below

*** `Import Status` changed to `Unmanaging`

*** `Is Managed` changed to `no`

*** The columns `Volume Profiling`, `Volumes` and `Alerts` would be hidden

*** `View Details` link would be available to check the task details
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a2batic View Details link is used for showing error list. The link should be named as Track Progress similar to Import task detail page which will take user to unmanage task detail page. @julienlim @mcarrano Where the unmanage task detail page will be shown in UI? Will it be similar to import task detail view or do we need to create a global task list view to display all the global tasks.

@a2batic @shtripat Will the cluster list API response provides unmanage task_id ? What will be the value of is_managed property after triggering Unmanage action and before the unmanage task gets completed? @a2batic Please add the API details too for eg. - API url, request data, response.

Copy link
Member

@a2batic a2batic Feb 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gnehapk, yes, it should be Track Progress.
@gnehapk, the value for is_managed should be 'no' after triggering unmanage action and before the unmanage task gets completed. I will add the API details.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gnehapk @a2batic
@mcarrano

View Details is used to show error list, it is also used to show import task details (as it's importing). See https://redhat.invisionapp.com/share/8QCOEVEY9#/screens/244738628.

For the unmanaged task details, it will be similar to the import task details view. Reason is there is no active cluster anymore during an unmanage cluster.


*** `Dashboard` button would be disabled

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julienlim @mcarrano, we added to disable 'Dashboard' button when unmanage is in progress, but design[1] has 'import' button disabled.

[1] https://redhat.invisionapp.com/share/8QCOEVEY9#/screens/244738628

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a2batic @mcarrano

After Unmanage action is clicked, you will see get the modal (https://redhat.invisionapp.com/share/8QCOEVEY9#/screens/273239640) followed by https://redhat.invisionapp.com/share/8QCOEVEY9#/screens/273239844.

Once you close out of the modals, the cluster page now show for the cluster being unmanaged what's shown in https://redhat.invisionapp.com/share/8QCOEVEY9#/screens/244738628 (the row pointed to by annotation #13) or https://redhat.invisionapp.com/share/8QCOEVEY9#/screens/244738625 (last row in the cluster list).

There is no "Dashboard" button as when you unmanage, there should no longer be Dashboard access.

*** Kebab menu for the un-managed cluster would be hidden

** Once the un-manage cluster task gets completed a global notification gets
received

** If task was successful, the state of the cluster would be changed to ready to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that cluster data is removed completely from etcd and this cluster detection logic detects it as a fresh cluster

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fresh cluster detected is actually a ready to import cluster

import

If task failed due to some issues, the cluster details would listed as below in

*** `Import Status` changed to `Unmanage failed`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose it failed half-way through, what is way forward?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally user would be allowed to execute un-manage again for the cluster (as discussed in 13th Feb 2018 arch call)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shtripat @nthomas-redhat @julienlim @mcarrano , user will be able to execute unmanage again, but before the user confirms unmanage, there should be a popup saying "The cluster unamange has been failed once with the <job_id>, it is recommended to resolve the error and then unmanage" or something similar ?
What are your thoughts?


*** `Is managed` changed to `no`

*** The columns `Volume Profiling`, `Volumes` and `Alerts` would be hidden

*** `View Details` link would be available to check the errors

*** `Dashboard` button would be disabled

*** Kebab menu for the un-managed cluster would be hidden

* If a previous import failed or cluster is in mis-configured state after import
(import failed with errors field not populated for cluster), the import and
un-manage both the options would be enabled in UI. If user selects the import
option now, it lands in import cluster view/page. If there was a previous import
failed, then modal dialog shows up and message would be something like `Import
cluster previously failed with <job_id>. Before import, you need to correct the
issues and then un-manage the cluster`. This dialog has `Ok` and `Cancel`
buttons.

* If un-manage fails, it would provide a tooltip/info with failure message `If
un-manage fails, resolve the issue and then try un-manage cluster again`. It
would show a message to say `Unmanage Cluster` failed having a `View Details`
hyperlink in the cluster list view.


=== Security impact:

None

=== Other end user impact:

User gets an option to un-manage an existing cluster and can re-import at later
stage

=== Performance impact:

None

=== Other deployer impact:

The tendrl-ansible module need to provide a mechanism to setup tendrl components
and dependencies on additional new node in the cluster.

<TBD> details to be added here of the plyabooks etc.

=== Developer impact:

None


== Implementation:

* https://github.com/Tendrl/commons/issues/797


=== Assignee(s):

Primary assignee:
shtripat
mbukatov
a2batic

=== Work Items:

* https://github.com/Tendrl/specifications/issues/252


== Dependencies:

* https://github.com/Tendrl/api/issues/349

== Testing:

* Check if UI dashboard has an option to trigger un-manage cluster flow

* Check if the flow gets completed successfully and verify if the grafana
dashboard reflects and cluster details available now for the selected cluster

* Verify that no grafana alert dashboards available now for the un-managed
cluster

* Verify that the clusters list report the cluster as un-managed and import
option is enabled now

* Try to import the cluster back and it should be successful. All grafana
dashboards, grafana alert dashboards and UI reflect the cluster details back

* Invoke the REST end point `clusters/{int-id}/unmanage` and the cluster should
be un-managed successfully

* On un-manage cluster completion, the alert dashboards in grafana would vanish
for the entities of the cluster like volume, bricks etc. Verify to make sure the
same happens as expected

* Once cluster is un-managed the details of the cluster would vanish from
dashboards in grafana. Verify the same happens as expected

* Verify that the final alert post un-manage flow, tells about removal of
details from grafana dashboards and grafana alert dashboards

* Verify the scenatio when a cluster import fails, and user is able to start
a un-manage + reimport cluster option from UI. UI should be able to list details
of both the tasks in this scenario

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one other scenario which should be tested and that's when un-manage fails.
Moreover check of View Details link should be here too - it should be present during un-manage task run and if it fails. It should provide almost the same as for import.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be also tested if services collectd and tendrl-gluster-integration are stopped and disabled on hosts of unmanaged cluster.


== Documentation impact:

* New un-manage cluster feature should be documented with details like what all
gets disabled / removed in case a cluster is un-managed

* New API end point should be documented with sample input / output structures

* The expected behavior post un-manage call in grafana dashboards should be
clearly mentioned in documents

== References:

* https://redhat.invisionapp.com/share/8QCOEVEY9

* https://github.com/Tendrl/commons/pull/798

* https://github.com/Tendrl/monitoring-integration/pull/317

* https://github.com/Tendrl/ui/issues/801