-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding core changes for API to check per awareness attribute health #5231
Adding core changes for API to check per awareness attribute health #5231
Conversation
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Codecov Report
@@ Coverage Diff @@
## main #5231 +/- ##
============================================
- Coverage 71.09% 70.94% -0.15%
+ Complexity 58499 58476 -23
============================================
Files 4748 4751 +3
Lines 278818 279104 +286
Branches 40296 40341 +45
============================================
- Hits 198219 198015 -204
- Misses 64477 64918 +441
- Partials 16122 16171 +49
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes in rest-api-spec is missing
|
||
@Override | ||
public List<Route> routes() { | ||
return List.of(new Route(GET, "/_cluster/ha_health")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HA sounds like too specific and many users might not relate with it. Shall we call it verbose_health
or something similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did finalize this during the discussions however ha sounds better to me . I will let bukhtawar and gaurav comment on this if required I can change that,.
|
||
@Override | ||
protected void clusterManagerOperation( | ||
final Task task, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this task used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have kept that if we need for future use and on debug level we can log the taskID if needed
* | ||
* @opensearch.internal | ||
*/ | ||
public class ClusterHaHealthRequest extends ClusterManagerNodeReadRequest<ClusterHaHealthRequest> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rather extend from ClusterHealthRequest
? That might help to get the capabilities of generic health requests like specifying indices, wait for status, etc. Although those extensions can be taken up in a subsequent PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought so but than segregating it seems better because if tomorrow we need to add new capabilities to this API itself would be better and more easy. So thinking in terms of single responsibility model per class this seems better to me.
|
||
private static final String CLUSTER_NAME = "clusterName"; | ||
private static final String STATUS = "domainHealth"; | ||
private static final String NUMBER_OF_NODES = "totalNodes"; | ||
private static final String NUMBER_OF_DATA_NODES = "totalDataNodes"; | ||
private static final String ACTIVE_SHARDS = "totalShards"; | ||
private static final String UNASSIGNED_SHARDS = "totalUnassignedShards"; | ||
private static final String INITIALIZING_SHARDS = "totalInitializingShards"; | ||
private static final String AWARENESS_ATTRIBUTE = "AwarenessAttribute"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of these variables also present in ClusterHealthResponse
with the only change is awareness attribute health. If it is just about adding one map in regular health check call, shall we just rather add query param in the health check call which would return more verbose health like here rather than build a new API altogether? Something like ?verbose_health
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. As cluster health is aggregation for cluster and not awareness attribute specific would be better to segregate this.
builder.field(ACTIVE_SHARDS, getActiveShards()); | ||
builder.field(UNASSIGNED_SHARDS, getUnassignedShards()); | ||
builder.field(INITIALIZING_SHARDS, getInitializingShards()); | ||
builder.field(AWARENESS_ATTRIBUTE, getAwarenessAttributeHealth()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field is the only difference between regular cluster health and this new API
builder.field(AWARENESS_ATTRIBUTE, getAwarenessAttributeHealth());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
Gradle Check (Jenkins) Run Completed with:
|
CHANGELOG.md
Outdated
@@ -68,6 +68,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), | |||
|
|||
## [Unreleased 2.x] | |||
### Added | |||
- Added feature to check per awareness attribute value health([#5231](https://github.com/opensearch-project/OpenSearch/pull/5231)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're adding a new API here, so it's not a "feature to check", no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense will make that change
* | ||
* @opensearch.internal | ||
*/ | ||
public class ClusterHaHealthAction extends ActionType<ClusterHaHealthResponse> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this to ClusterAwarenessHealthAction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will make the change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
public class ClusterHaHealthAction extends ActionType<ClusterHaHealthResponse> { | ||
|
||
public static final ClusterHaHealthAction INSTANCE = new ClusterHaHealthAction(); | ||
public static final String NAME = "cluster:monitor/ha_health"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: awareness_health
* @param clusterState The current cluster state. Must not be null. | ||
* @param clusterSettings the current cluster settings. | ||
*/ | ||
public ClusterHaHealth(ClusterState clusterState, ClusterSettings clusterSettings) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constructor is just too big, would prefer smaller methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will make this as small methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added smaller methods
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
@@ -205,6 +207,15 @@ public interface ClusterAdminClient extends OpenSearchClient { | |||
*/ | |||
void state(ClusterStateRequest request, ActionListener<ClusterStateResponse> listener); | |||
|
|||
/** | |||
* The ha_health of the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/ha_health/awareness_health/g
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
// Getting the awareness attribute list | ||
List<String> awarenessAttributeList = clusterSettings.get( | ||
AwarenessAllocationDecider.CLUSTER_ROUTING_ALLOCATION_AWARENESS_ATTRIBUTE_SETTING | ||
); | ||
|
||
// Getting the replicaEnforcement settings as both are necessary for replica enforcement. | ||
boolean replicaEnforcement = clusterSettings.get(AwarenessReplicaBalance.CLUSTER_ROUTING_ALLOCATION_AWARENESS_BALANCE_SETTING); | ||
Settings forcedAwarenessSettings = clusterSettings.get( | ||
AwarenessAllocationDecider.CLUSTER_ROUTING_ALLOCATION_AWARENESS_FORCE_GROUP_SETTING | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets add a validator which validates for following things
- CLUSTER_ROUTING_ALLOCATION_AWARENESS_ATTRIBUTE_SETTING
- CLUSTER_ROUTING_ALLOCATION_AWARENESS_FORCE_GROUP_SETTING
- replicaEnforcement
- Single attribute only (two attributes like zone and rack are not supported)
If validation fails, we can do either of these
- [Preferred] Throw an exception
- Not populate ZoneHealth
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed we will through empty map and not exception so not moving the same to validator
); | ||
} | ||
|
||
private ClusterHealthStatus generateIndexAndStateStats(ClusterState clusterState) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we reuse TransportClusterHealthAction
to do most of the heavy lifting here ? I did a quick POC earlier : https://github.com/gbbafna/admin-plugin/blob/main/src/main/java/org/opensearch/path/to/plugin/TransportClusterHAHealthAction.java#L113 . The benefits are two fold
- Any evolution of cluster health can be reused .
- Cluster Health is much more complex and provides option to wait for unassigned shards etc . We can reuse that later if needed .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was done on consistency standpoint so that the state on full call is not changed. Will check if we can reuse the class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets populate status
as well in here and rename to setClusterHealth
.
private int unassignedShards; | ||
private int initializingShards; | ||
private ClusterHealthStatus status; | ||
private final Map<String, Object> collatedAttributeHealth; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of Object
, lets created AttributeHealth
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed we are having multiple attribute value and per attribute value we are returning a list of Attribute values so instead of object class AttributeHealth class cannot be used
private int activeShards; | ||
private int unassignedShards; | ||
private int initializingShards; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets add relocating shards as well .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
return shardAllocationPerNode; | ||
} | ||
|
||
private Map<String, Object> generateAwarenessAttributeHealthMap( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to break this function for better code readability?
1ae9406
to
9b5101f
Compare
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
9b5101f
to
9b3d294
Compare
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
} | ||
|
||
// Constructor use by Unit test case. | ||
public ClusterAwarenessAttributeValueHealth( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make all places have package private usages if used solely for tests
* | ||
* @param name name of awareness attribute | ||
*/ | ||
public ClusterAwarenessAttributeValueHealth(String name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we pass the list of RoutingNode
with the same attribute value?
String nodeId; | ||
|
||
// computing active, relocating and initializing shards info | ||
for (ShardRouting shard : clusterState.routingTable().allShards()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to iterate over all shards for all distinct attribute value, its a wasted computation. Instead we could filter the RoutingNode
for this attribute value and pass it in the the ctor and iterate over only those set of nodes. This would also avoid the check for nodeInCurrentAZ
attributeValue = nodeAttributes.get(awarenessAttributeName); | ||
|
||
if (!attributesHealth.containsKey(attributeValue)) { | ||
clusterAwarenessAttributeValueHealth = new ClusterAwarenessAttributeValueHealth(attributeValue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets pass the RoutingNode here to avoid unnecessary computation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
computation will still exists in the parent class . However will make the required change
|
||
// This is the map that would store all the stats per attribute ie | ||
// health stats for rack-1, rack-2 etc. | ||
awarenessAttributeValueHealthMap = new HashMap<>(Collections.emptyMap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove Collections.emptyMap()
?
Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
Gradle Check (Jenkins) Run Completed with:
|
double attributeWeight = 1.0; | ||
if (weightedRoutingMetadata != null) { | ||
WeightedRouting weightedRouting = weightedRoutingMetadata.getWeightedRouting(); | ||
attributeWeight = weightedRouting.weights().getOrDefault(name, 1.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we should default to 1 if no weights specified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then this would mean that all values are active and no specific parameter is set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nishchay21, on a high level the changes look good to me. Can you also run this through @gbbafna once more to see if this looks good.
Gradle Check (Jenkins) Run Completed with:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Two nit comments regarding naming .
@@ -54,20 +56,20 @@ public ClusterAwarenessAttributesHealth( | |||
|
|||
// This is the Map which is storing the per attribute value health stats. | |||
// The key would be the attribute value like rack-1 and the stats would be stored as value. | |||
Map<String, ClusterAwarenessAttributeValueHealth> attributesHealth = new HashMap<>(); | |||
Map<String, List<String>> attributesHealth = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets rename this as this doesn't consists of health , but list of nodes node. Edit the comment above as well.
@@ -94,7 +93,7 @@ public ClusterAwarenessAttributesHealth( | |||
} | |||
|
|||
private void setClusterAwarenessAttributeValue( | |||
Map<String, ClusterAwarenessAttributeValueHealth> perAttributeValueHealthMap, | |||
Map<String, List<String>> perAttributeValueHealthMap, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit : rename here as well .
Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
Gradle Check (Jenkins) Run Completed with:
|
…5231) (#5776) * Adding core changes for API to check per awareness attribute health. Signed-off-by: Nishchay Malhotra <nishcha@amazon.com> (cherry picked from commit 7578cd8) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Signed-off-by: Nishchay Malhotra <nishcha@amazon.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Signed-off-by: Nishchay Malhotra nishcha@amazon.com
Description
The PR solves issue #5230. It adds the capability of new API where one can view per awareness attribute health status. As of today opensearch only shows health for full cluster however health per awareness attribute is still not seen. Added an API today monitor the cluster availability status per awareness attribute as that would give use more information on how many shards are assigned per awareness attribute value, how many nodes are distributes per awareness attribute value etc. Follwoing things can be shown per awareness attribute value [eg rack as attribute]:
Issues Resolved
#5230.
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.