CA-420968: avoid large performance hit on small NUMA nodes

edwintorok · edwintorok · commit 44080302a1fb · 2025-11-27T10:28:24.000Z
NUMA optimized placement can have a large performance hit on machines with
small NUMA nodes and VMs with a large number of vCPUs.
For example a machine that has 2 sockets, which can run at most 32 vCPUs in a
single socket (NUMA node), and a VM with 32 vCPUs.

Usually Xen would try to spread the load across actual cores, and avoid the
hyperthread siblings, e.g. using CPUs 0,2,4,etc.
But when NUMA placement is used all the vCPUs must be in the same NUMA node.
If that NUMA node doesn't have enough cores, then Xen will have no choice but
to use CPUs 0,1,2,3,etc.

Hyperthread siblings share resources, and if you try to use both at the same
time you get a big performance hit, depending on the workload.

Avoid this by "requesting" cores=vcpus for each VM,
which will make the placement algorithm choose the next size up in terms of
NUMA nodes (i.e. instead of 1 NUMA node, use 2,3 as needed, falling back to using
all nodes if needed).

The potential gain from reducing memory latency with a NUMA optimized placement
(~20% on Intel Memory Latency Checker: Idle latency) is outweighed by
the potential loss due to reduced CPU capacity (40%-75% on OpenSSL, POV-Ray, and
OpenVINO), so this is the correct trade-off.

If the NUMA node is large enough, or if the VMs have a small number of vCPUs
then we still try to use a single NUMA node as we did previously.

The performance difference can be reproduced and verified easily by running
`openssl speed -multi 32 rsa4096` on a 32 vCPU VM on a host that has 2 NUMA
nodes, with 32 PCPUs each, and 2 threads per core.
This introduces a policy that can control whether we want to filter out
NUMA nodes with too few cores.

Although we want to enable this filter by default, we still want
an "escape hatch" to turn it off if we find problems with it.
That is why the "compat" setting (numa_placement=true) in xenopsd.conf
reverts back to the old policy, which is now named explicitly as Prio_mem_only.

There could still be workloads where optimizing for memory bandwidth makes more
sense (although that is a property of the NUMA node, not of individual VMs),
so although it might be desirable for this to be a VM policy, it cannot,
because it affects other VMs too.

TODO: when sched-gran=core this should be turned off. That always has the
performance hit, so might as well use smaller NUMA nodes if available.

For now this isn't exposed yet as a XAPI-level policy, because that requires
more changes (to also sort by free cores on a node, and to also sort at the
pool level by free cpus on a host).
Once we have those changes we can introduce a new policy `prio_core_mem`
to sort by free cores first, then by free memory, and requires cores&gt;=vcpus
(i.e. cpus&gt;=vcpus*threads_per_cores) when choosing a node.

This changes the default to the new setting, which should be equal or an
improvement in the general case.
An "escape hatch" to revert to the previous behaviour is to set
`numa-placement=true` in xenopsd.conf, and the XAPI host-level policy to
'default_policy'.

Signed-off-by: Edwin Török &lt;edwin.torok@citrix.com&gt;
diff --git a/ocaml/xapi-idl/xen/xenops_interface.ml b/ocaml/xapi-idl/xen/xenops_interface.ml
@@ -492,7 +492,9 @@ module Host = struct
     | Best_effort
         (** Best-effort placement. Assigns the memory of the VM to a single
             node, and soft-pins its VCPUs to the node, if possible. Otherwise
-            behaves like Any. *)
+            behaves like Any.
+            The node(s) need to have enough cores to run all the vCPUs of the VM
+            *)
     | Best_effort_hard  (** Like Best_effort, but hard-pins the VCPUs *)
     | Prio_mem_only
         (** Prioritizes reducing memory bandwidth, ignores CPU overload *)
diff --git a/ocaml/xenopsd/lib/xenops_server.ml b/ocaml/xenopsd/lib/xenops_server.ml
@@ -3630,9 +3630,9 @@ let affinity_of_numa_affinity_policy =
   function
   | Any | Best_effort | Prio_mem_only -> Soft | Best_effort_hard -> Hard
 
-let cores_of_numa_affinity_policy policy ~vcpus:_ =
+let cores_of_numa_affinity_policy policy ~vcpus =
   let open Xenops_interface.Host in
-  match policy with _ -> 0
+  match policy with Any | Prio_mem_only -> 0 | _ -> vcpus
 
 module HOST = struct
   let stat _ dbg =