docs: describe omitted spread behavior and perf impact (hashicorp#2…

…3184) Update the documentation for the `spread` block: * Make it clear that the default behavior within a given job when the `spread` block is omitted is to spread out allocs among feasible nodes. * Describe the difference between the `spread` block and `spread` scheduler algorithm. * Add warnings about the performance impact of using `spread` and how to mitigate it.
lattwood · Jun 5, 2024 · 17093d6 · 17093d6
1 parent abc6fe3
commit 17093d6
Showing 1 changed file with 51 additions and 13 deletions.
diff --git a/website/content/docs/job-specification/spread.mdx b/website/content/docs/job-specification/spread.mdx
@@ -23,8 +23,11 @@ description: >-
 The `spread` block allows operators to increase the failure tolerance of their
 applications by specifying a node attribute that allocations should be spread
 over. This allows operators to spread allocations over attributes such as
-datacenter, availability zone, or even rack in a physical datacenter. By
-default, when using spread the scheduler will attempt to place allocations
+datacenter, availability zone, or even rack in a physical datacenter.
+
+By default, when `spread` is omitted, the scheduler will attempt to place
+allocations from the same job on different nodes (and binpacked between
+jobs). When using `spread` the scheduler will attempt to place allocations
 equally among the available values of the given target.
 
 ```hcl
@@ -49,20 +52,23 @@ job "docs" {
 }
 ```
 
-Nodes are scored according to how closely they match the desired target percentage defined in the
-spread block. Spread scores are combined with other scoring factors such as bin packing.
+Nodes are scored according to how closely they match the desired target
+percentage defined in the spread block. Spread scores are combined with other
+scoring factors such as bin packing.
 
-A job or task group can have more than one spread criteria, with weights to express relative preference.
+A job or task group can have more than one spread criteria, with weights to
+express relative preference.
 
-Spread criteria are treated as a soft preference by the Nomad
-scheduler. If no nodes match a given spread criteria, placement is
-still successful. To avoid scoring every node for every placement,
-allocations may not be perfectly spread. Spread works best on
-attributes with similar number of nodes: identically configured racks
-or similarly configured datacenters.
+Spread criteria are treated as a soft preference by the Nomad scheduler. If no
+nodes match a given spread criteria, placement is still successful. To avoid
+scoring every node for every placement, allocations may not be perfectly
+spread. Spread works best on attributes with similar number of nodes:
+identically configured racks or similarly configured datacenters.
 
-Spread may be expressed on [attributes][interpolation] or [client metadata][client-meta].
-Additionally, spread may be specified at the [job][job] and [group][group] levels for ultimate flexibility. Job level spread criteria are inherited by all task groups in the job.
+Spread may be expressed on [attributes][interpolation] or [client
+metadata][client-meta].  Additionally, spread may be specified at the [job][job]
+and [group][group] levels for ultimate flexibility. Job level spread criteria
+are inherited by all task groups in the job.
 
 ## `spread` Parameters
 
@@ -84,6 +90,36 @@ Additionally, spread may be specified at the [job][job] and [group][group] level
 
 - `percent` `(integer:0)` - Specifies the percentage associated with the target value.
 
+## Comparison to `spread` Scheduling Algorithm
+
+The `spread` block is not the same concept as setting the [scheduler
+algorithm][] to `"spread"` instead of `"binpack"`. Setting the scheduler
+algorithm impacts all jobs on a cluster (or node pool), and adjusts the tendency
+of the scheduler to place workloads from different jobs on the same set of nodes
+or not. The `spread` block impacts how the scheduler places allocations for a
+given job.
+
+## Scheduling Performance
+
+Using the `spread` block can have significant impact on scheduling
+performance. For each allocation in a `service` and `batch` job, the scheduler
+iterates over nodes until it finds a small number of feasible nodes. Those
+feasible nodes are then scored to find the best placement.
+
+When `spread` is omitted, this limit is 2 for batch jobs and the log<sub>2</sub>
+of the total number of nodes in the datacenter and node pool (with a minimum of
+2) for service jobs. When the `spread` block is present, the scheduler instead
+scores a number of nodes in the datacenter and node pool equal to the task group
+count (with a maximum of 100) per allocation. This can result in
+order-of-magnitude increases in scheduling times.
+
+To monitor scheduling times potentially impacted by `spread` blocks, examine the
+`nomad.nomad.worker.invoke_scheduler.*` found in the [Key Metrics][] table. You
+can reduce scheduling times by avoiding `spread` and instead relying on the
+default distribution of a job across multiple nodes. If this is not possible,
+you may consider reducing the size of the node pool or datacenter to reduce the
+number of nodes available for the scheduler to consider.
+
 ## `spread` Examples
 
 The following examples show different ways to use the `spread` block.
@@ -165,3 +201,5 @@ spread {
 [interpolation]: /nomad/docs/runtime/interpolation 'Nomad interpolation'
 [node-variables]: /nomad/docs/runtime/interpolation#node-variables- 'Nomad interpolation-Node variables'
 [constraint]: /nomad/docs/job-specification/constraint 'Nomad Constraint job Specification'
+[Key Metrics]: /nomad/docs/operations/metrics-reference#key-metrics
+[scheduler algorithm]: /nomad/docs/commands/operator/scheduler/set-config#scheduler-algorithm