Skip to content

Commit 088fbe0

Browse files
add wlm feature overview (#8632) (#8740)
1 parent c897a3c commit 088fbe0

File tree

1 file changed

+194
-0
lines changed

1 file changed

+194
-0
lines changed
Lines changed: 194 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,194 @@
1+
---
2+
layout: default
3+
title: Workload management
4+
nav_order: 70
5+
has_children: true
6+
parent: Availability and recovery
7+
---
8+
9+
Introduced 2.18
10+
{: .label .label-purple }
11+
12+
# Workload management
13+
14+
Workload management allows you to group search traffic and isolate network resources, preventing the overuse of network resources by specific requests. It offers the following benefits:
15+
16+
- Tenant-level admission control and reactive query management. When resource usage exceeds configured limits, it automatically identifies and cancels demanding queries, ensuring fair resource distribution.
17+
18+
- Tenant-level isolation within the cluster for search workloads, operating at the node level.
19+
20+
## Installing workload management
21+
22+
To install workload management, use the following command:
23+
24+
```json
25+
./bin/opensearch-plugin install workload-management
26+
```
27+
{% include copy-curl.html %}
28+
29+
## Query groups
30+
31+
A _query group_ is a logical grouping of tasks with defined resource limits. System administrators can dynamically manage query groups using the Workload Management APIs. These query groups can be used to create search requests with resource limits.
32+
33+
### Permissions
34+
35+
Only users with administrator-level permissions can create and update query groups using the Workload Management APIs.
36+
37+
### Operating modes
38+
39+
The following operating modes determine the operating level for a query group:
40+
41+
- **Disabled mode**: Workload management is disabled.
42+
43+
- **Enabled mode**: Workload management is enabled and will cancel and reject queries once the query group's configured thresholds are reached.
44+
45+
- **Monitor_only mode** (Default): Workload management will monitor tasks but will not cancel or reject any queries.
46+
47+
### Example request
48+
49+
The following example request adds a query group named `analytics`:
50+
51+
```json
52+
PUT _wlm/query_group
53+
{
54+
“name”: “analytics”,
55+
“resiliency_mode”: “enforced”,
56+
“resource_limits”: {
57+
“cpu”: 0.4,
58+
“memory”: 0.2
59+
}
60+
}
61+
```
62+
{% include copy-curl.html %}
63+
64+
When creating a query group, make sure that the sum of the resource limits for a single resource, such as `cpu` or `memory`, does not exceed `1`.
65+
66+
### Example response
67+
68+
OpenSearch responds with the set resource limits and the `_id` for the query group:
69+
70+
```json
71+
{
72+
"_id":"preXpc67RbKKeCyka72_Gw",
73+
"name":"analytics",
74+
"resiliency_mode":"enforced",
75+
"resource_limits":{
76+
"cpu":0.4,
77+
"memory":0.2
78+
},
79+
"updated_at":1726270184642
80+
}
81+
```
82+
83+
## Using `queryGroupID`
84+
85+
You can associate a query request with a `queryGroupID` to manage and allocate resources within the limits defined by the query group. By using this ID, request routing and tracking are associated with the query group, ensuring resource quotas and task limits are maintained.
86+
87+
The following example query uses the `queryGroupId` to ensure that the query does not exceed that query group's resource limits:
88+
89+
```json
90+
GET testindex/_search
91+
Host: localhost:9200
92+
Content-Type: application/json
93+
queryGroupId: preXpc67RbKKeCyka72_Gw
94+
{
95+
"query": {
96+
"match": {
97+
"field_name": "value"
98+
}
99+
}
100+
}
101+
```
102+
{% include copy-curl.html %}
103+
104+
## Workload management settings
105+
106+
The following settings can be used to customize workload management using the `_cluster/settings` API.
107+
108+
| **Setting name** | **Description** |
109+
| :--- | :--- |
110+
| `wlm.query_group.duress_streak` | Determines the node duress threshold. Once the threshold is reached, the node is marked as `in duress`. |
111+
| `wlm.query_group.enforcement_interval` | Defines the monitoring interval. |
112+
| `wlm.query_group.mode` | Defines the [operating mode](#operating-modes). |
113+
| `wlm.query_group.node.memory_rejection_threshold` | Defines the query group level `memory` threshold. When the threshold is reached, the request is rejected. |
114+
| `wlm.query_group.node.cpu_rejection_threshold` | Defines the query group level `cpu` threshold. When the threshold is reached, the request is rejected. |
115+
| `wlm.query_group.node.memory_cancellation_threshold` | Controls whether the node is considered to be in duress when the `memory` threshold is reached. Requests routed to nodes in duress are canceled. |
116+
| `wlm.query_group.node.cpu_cancellation_threshold` | Controls whether the node is considered to be in duress when the `cpu` threshold is reached. Requests routed to nodes in duress are canceled. |
117+
118+
When setting rejection and cancellation thresholds, remember that the rejection threshold for a resource should always be lower than the cancellation threshold.
119+
120+
## Workload Management Stats API
121+
122+
The Workload Management Stats API returns workload management metrics for a query group, using the following method:
123+
124+
```json
125+
GET _wlm/stats
126+
```
127+
{% include copy-curl.html %}
128+
129+
### Example response
130+
131+
```json
132+
{
133+
“_nodes”: {
134+
“total”: 1,
135+
“successful”: 1,
136+
“failed”: 0
137+
},
138+
“cluster_name”: “XXXXXXYYYYYYYY”,
139+
“A3L9EfBIQf2anrrUhh_goA”: {
140+
“query_groups”: {
141+
“16YGxFlPRdqIO7K4EACJlw”: {
142+
“total_completions”: 33570,
143+
“total_rejections”: 0,
144+
“total_cancellations”: 0,
145+
“cpu”: {
146+
“current_usage”: 0.03319935314357281,
147+
“cancellations”: 0,
148+
“rejections”: 0
149+
},
150+
“memory”: {
151+
“current_usage”: 0.002306486276211217,
152+
“cancellations”: 0,
153+
“rejections”: 0
154+
}
155+
},
156+
“DEFAULT_QUERY_GROUP”: {
157+
“total_completions”: 42572,
158+
“total_rejections”: 0,
159+
“total_cancellations”: 0,
160+
“cpu”: {
161+
“current_usage”: 0,
162+
“cancellations”: 0,
163+
“rejections”: 0
164+
},
165+
“memory”: {
166+
“current_usage”: 0,
167+
“cancellations”: 0,
168+
“rejections”: 0
169+
}
170+
}
171+
}
172+
}
173+
}
174+
```
175+
{% include copy-curl.html %}
176+
177+
### Response body fields
178+
179+
| Field name | Description |
180+
| :--- | :--- |
181+
| `total_completions` | The total number of request completions in the `query_group` at the given node. This includes all shard-level and coordinator-level requests. |
182+
| `total_rejections` | The total number request rejections in the `query_group` at the given node. This includes all shard-level and coordinator-level requests. |
183+
| `total_cancellations` | The total number of cancellations in the `query_group` at the given node. This includes all shard-level and coordinator-level requests. |
184+
| `cpu` | The `cpu` resource type statistics for the `query_group`. |
185+
| `memory` | The `memory` resource type statistics for the `query_group`. |
186+
187+
### Resource type statistics
188+
189+
| Field name | Description |
190+
| :--- | :---- |
191+
| `current_usage` |The resource usage for the `query_group` at the given node based on the last run of the monitoring thread. This value is updated based on the `wlm.query_group.enforcement_interval`. |
192+
| `cancellations` | The number of cancellations resulting from the cancellation threshold being reached. |
193+
| `rejections` | The number of rejections resulting from the cancellation threshold being reached. |
194+

0 commit comments

Comments
 (0)