Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dashboard supports removing of unhealth machines #168

Merged
merged 1 commit into from
Mar 4, 2019

Conversation

jasonjoo2010
Copy link
Collaborator

@jasonjoo2010 jasonjoo2010 commented Oct 10, 2018

New Features

This PR includes:

  • Remove unhealthy machine manually
  • Auto removal support for dead machine
  • Auto hidden for app if one have no healthy machine after some time(configurable)
  • Auto removal for app if hidden status lasts(configurable).

Four New Configurations Introduced

name type default minimum value description
sentinel.dashboard.app.hideAppNoMachineMillis Integer 0 60000 Whether should hide apps having no healthy machine. The milliseconds from the last time heard from machine(s). Default to no(0).
sentinel.dashboard.removeAppNoMachineMillis Integer 0 120000 Whether should remove apps having no healthy machine. The milliseconds from the last time heard from machine(s). Default to no(0).
sentinel.dashboard.unhealthyMachineMillis Integer 60000 30000 Whether a machine is healthy or not. The milliseconds from the last time heard from the machine. Default to 300 seconds and should not be less than 60 seconds.
sentinel.dashboard.autoRemoveMachineMillis Integer 0 300000 Whether should remove unhealthy machines. The milliseconds from the last time heard from the machine. Default to no(0).

Configuration Sources

Configurations support System.getProperty() and System.getenv(). ENV has the priority.
Because for env variables .(dot) is not allowed in name it should be replaced into _.

Examples:

Shell

java -Dsentinel.dashboard.app.hideAppNoMachineMillis=60000

Java

System.setProperty("sentinel.dashboard.app.hideAppNoMachineMillis", "60000");

ENV

sentinel_dashboard_app_hideAppNoMachineMillis=60000

@sczyh30 sczyh30 added the to-review To review label Oct 10, 2018
@codecov-io
Copy link

codecov-io commented Oct 10, 2018

Codecov Report

Merging #168 into master will increase coverage by 0.03%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #168      +/-   ##
============================================
+ Coverage     37.69%   37.72%   +0.03%     
- Complexity     1102     1103       +1     
============================================
  Files           259      259              
  Lines          8174     8174              
  Branches       1113     1113              
============================================
+ Hits           3081     3084       +3     
+ Misses         4695     4693       -2     
+ Partials        398      397       -1
Impacted Files Coverage Δ Complexity Δ
...a/csp/sentinel/slots/statistic/base/LeapArray.java 69.62% <0%> (+3.79%) 25% <0%> (+1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b54461b...3ead8f7. Read the comment docs.

@sczyh30
Copy link
Member

sczyh30 commented Oct 12, 2018

Hi, we've updated the dashboard (#176) which brings some changes. You'll need to resolve the conflicts and verify it.

@CLAassistant
Copy link

CLAassistant commented Oct 12, 2018

CLA assistant check
All committers have signed the CLA.

@jasonjoo2010
Copy link
Collaborator Author

Hi, we've updated the dashboard (#176) which brings some changes. You'll need to resolve the conflicts and verify it.

merged

@jasonjoo2010
Copy link
Collaborator Author

Hi, we've updated the dashboard (#176) which brings some changes. You'll need to resolve the conflicts and verify it.

And what about issue #179 ? Adding more API is better?

@sczyh30
Copy link
Member

sczyh30 commented Jan 28, 2019

@jasonjoo2010 Some more considerations should be taken. Would you like to work on it? See here for more information: #446 (comment)

@jasonjoo2010
Copy link
Collaborator Author

@jasonjoo2010 Some more considerations should be taken. Would you like to work on it? See here for more information: #446 (comment)

Sorry for the latency.
Should i reopen it?

This kind of PR conflicts easily each other.

@sczyh30
Copy link
Member

sczyh30 commented Feb 25, 2019

@jasonjoo2010 Some more considerations should be taken. Would you like to work on it? See here for more information: #446 (comment)

Sorry for the latency.
Should i reopen it?

This kind of PR conflicts easily each other.

Feel free to reopen it :)
Some more improvements are listed in #446 (comment) and you can refer to that.

@jasonjoo2010 jasonjoo2010 reopened this Feb 25, 2019
@jasonjoo2010
Copy link
Collaborator Author

@sczyh30

  • When an application has no connected machines in machine list, this application might not be displayed in the app list (sidebar). This should also be configurable.

Is it auto hidden or auto removal? Which one is better?

Auto removal will take advantage in resource cost.
Or we can take hidden first then removal after specific time(eg. 2 hours later)

@jasonjoo2010
Copy link
Collaborator Author

@sczyh30

And i have one more question:

A timestamp from node end is submitted to dashboard when heart beating. Dashboard saves it to MachineInfo and converts it into Date.

My questions are:

  1. Why converting to Date? Millisecond is simple enough.
  2. Timestamp from Node is quite different from Dashboard's local. And some logics like healthy or others compare them together. I think it's not a good practice.

@sczyh30
Copy link
Member

sczyh30 commented Feb 26, 2019

@sczyh30

  • When an application has no connected machines in machine list, this application might not be displayed in the app list (sidebar). This should also be configurable.

Is it auto hidden or auto removal? Which one is better?

Auto removal will take advantage in resource cost.
Or we can take hidden first then removal after specific time(eg. 2 hours later)

Both are okay. The later solution might be better. As some timeout passed the old AppInfo can be evicted.

@sczyh30
Copy link
Member

sczyh30 commented Feb 26, 2019

And i have one more question:

A timestamp from node end is submitted to dashboard when heart beating. Dashboard saves it to MachineInfo and converts it into Date.

My questions are:

  1. Why converting to Date? Millisecond is simple enough.
  2. Timestamp from Node is quite different from Dashboard's local. And some logics like healthy or others compare them together. I think it's not a good practice.

For 1, it might be a legacy design. I've reviewed the code where used the timestamp and found that using ms directly is enough.
For 2, in current design, the system time of the dashboard should match the time of the service machine, or metrics cannot be retrieved normally. Could you please give some suggestions?

@jasonjoo2010
Copy link
Collaborator Author

And i have one more question:
A timestamp from node end is submitted to dashboard when heart beating. Dashboard saves it to MachineInfo and converts it into Date.
My questions are:

  1. Why converting to Date? Millisecond is simple enough.
  2. Timestamp from Node is quite different from Dashboard's local. And some logics like healthy or others compare them together. I think it's not a good practice.

For 1, it might be a legacy design. I've reviewed the code where used the timestamp and found that using ms directly is enough.
For 2, in current design, the system time of the dashboard should match the time of the service machine, or metrics cannot be retrieved normally. Could you please give some suggestions?

For consistent design we'd better compare objects having exactly the same reference if there is no specific reason. And after some reading i found the timestamp is not just timestamp nor date but indicating a heartbeat version. So i rename it into heartbeatVersion and add a new property lastHeartbeat indicating just the last heartbeat packet's arrival time.

And add new methods to MachineInfo and AppInfo instead of duplicated in current system doing dead machine judgement everywhere(angularjs, java).

I add four new configuration as the description said. You can check it before the PR updated.

@jasonjoo2010 jasonjoo2010 force-pushed the machineRemove branch 3 times, most recently from dcc75bf to cfd36b9 Compare February 26, 2019 10:48
@jasonjoo2010
Copy link
Collaborator Author

@sczyh30 All changes have been submitted

@sczyh30 sczyh30 added the to-review To review label Feb 26, 2019
@jasonjoo2010 jasonjoo2010 force-pushed the machineRemove branch 2 times, most recently from 3ce0813 to b7867ef Compare March 4, 2019 01:43
@sczyh30 sczyh30 added area/dashboard Issues or PRs about Sentinel Dashboard size/XL Indicate a PR that changes 500-999 lines. labels Mar 4, 2019
Copy link
Member

@sczyh30 sczyh30 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sczyh30 sczyh30 merged commit 737747a into alibaba:master Mar 4, 2019
@sczyh30 sczyh30 removed the to-review To review label Mar 4, 2019
@sczyh30
Copy link
Member

sczyh30 commented Mar 4, 2019

Awesome, thanks for contributing!

@sczyh30 sczyh30 added this to the 1.5.0 milestone Mar 4, 2019
CST11021 pushed a commit to CST11021/Sentinel that referenced this pull request Nov 3, 2021
…ception

Author: lindzh <linsony0@163.com>

Closes alibaba#168 from lindzh/fix_print.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dashboard Issues or PRs about Sentinel Dashboard size/XL Indicate a PR that changes 500-999 lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants