Skip to content

Commit 4408030

Browse files
committed
CA-420968: avoid large performance hit on small NUMA nodes
NUMA optimized placement can have a large performance hit on machines with small NUMA nodes and VMs with a large number of vCPUs. For example a machine that has 2 sockets, which can run at most 32 vCPUs in a single socket (NUMA node), and a VM with 32 vCPUs. Usually Xen would try to spread the load across actual cores, and avoid the hyperthread siblings, e.g. using CPUs 0,2,4,etc. But when NUMA placement is used all the vCPUs must be in the same NUMA node. If that NUMA node doesn't have enough cores, then Xen will have no choice but to use CPUs 0,1,2,3,etc. Hyperthread siblings share resources, and if you try to use both at the same time you get a big performance hit, depending on the workload. Avoid this by "requesting" cores=vcpus for each VM, which will make the placement algorithm choose the next size up in terms of NUMA nodes (i.e. instead of 1 NUMA node, use 2,3 as needed, falling back to using all nodes if needed). The potential gain from reducing memory latency with a NUMA optimized placement (~20% on Intel Memory Latency Checker: Idle latency) is outweighed by the potential loss due to reduced CPU capacity (40%-75% on OpenSSL, POV-Ray, and OpenVINO), so this is the correct trade-off. If the NUMA node is large enough, or if the VMs have a small number of vCPUs then we still try to use a single NUMA node as we did previously. The performance difference can be reproduced and verified easily by running `openssl speed -multi 32 rsa4096` on a 32 vCPU VM on a host that has 2 NUMA nodes, with 32 PCPUs each, and 2 threads per core. This introduces a policy that can control whether we want to filter out NUMA nodes with too few cores. Although we want to enable this filter by default, we still want an "escape hatch" to turn it off if we find problems with it. That is why the "compat" setting (numa_placement=true) in xenopsd.conf reverts back to the old policy, which is now named explicitly as Prio_mem_only. There could still be workloads where optimizing for memory bandwidth makes more sense (although that is a property of the NUMA node, not of individual VMs), so although it might be desirable for this to be a VM policy, it cannot, because it affects other VMs too. TODO: when sched-gran=core this should be turned off. That always has the performance hit, so might as well use smaller NUMA nodes if available. For now this isn't exposed yet as a XAPI-level policy, because that requires more changes (to also sort by free cores on a node, and to also sort at the pool level by free cpus on a host). Once we have those changes we can introduce a new policy `prio_core_mem` to sort by free cores first, then by free memory, and requires cores>=vcpus (i.e. cpus>=vcpus*threads_per_cores) when choosing a node. This changes the default to the new setting, which should be equal or an improvement in the general case. An "escape hatch" to revert to the previous behaviour is to set `numa-placement=true` in xenopsd.conf, and the XAPI host-level policy to 'default_policy'. Signed-off-by: Edwin Török <edwin.torok@citrix.com>
1 parent 73a5f40 commit 4408030

File tree

2 files changed

+5
-3
lines changed

2 files changed

+5
-3
lines changed

ocaml/xapi-idl/xen/xenops_interface.ml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -492,7 +492,9 @@ module Host = struct
492492
| Best_effort
493493
(** Best-effort placement. Assigns the memory of the VM to a single
494494
node, and soft-pins its VCPUs to the node, if possible. Otherwise
495-
behaves like Any. *)
495+
behaves like Any.
496+
The node(s) need to have enough cores to run all the vCPUs of the VM
497+
*)
496498
| Best_effort_hard (** Like Best_effort, but hard-pins the VCPUs *)
497499
| Prio_mem_only
498500
(** Prioritizes reducing memory bandwidth, ignores CPU overload *)

ocaml/xenopsd/lib/xenops_server.ml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3630,9 +3630,9 @@ let affinity_of_numa_affinity_policy =
36303630
function
36313631
| Any | Best_effort | Prio_mem_only -> Soft | Best_effort_hard -> Hard
36323632

3633-
let cores_of_numa_affinity_policy policy ~vcpus:_ =
3633+
let cores_of_numa_affinity_policy policy ~vcpus =
36343634
let open Xenops_interface.Host in
3635-
match policy with _ -> 0
3635+
match policy with Any | Prio_mem_only -> 0 | _ -> vcpus
36363636

36373637
module HOST = struct
36383638
let stat _ dbg =

0 commit comments

Comments
 (0)