Skip to content

Conversation

@edwintorok
Copy link
Contributor

@edwintorok edwintorok commented Nov 21, 2025

NUMA optimized placement can have a large performance hit on machines with small NUMA nodes and VMs with a large number of vCPUs. For example a machine that has 2 sockets, which can run at most 32 vCPUs in a single socket (NUMA node), and a VM with 32 vCPUs.

Usually Xen would try to spread the load across actual cores, and avoid the hyperthread siblings (when the machine is sufficiently idle, or the workload is bursty), e.g. using CPUs 0,2,4,etc.
But when NUMA placement is used all the vCPUs must be in the same NUMA node. If that NUMA node doesn't have enough cores, then Xen will have no choice but to use CPUs 0,1,2,3,etc.

Hyperthread siblings share resources, and if you try to use both at the same time you get a big performance hit, depending on the workload. We've also seen this previously with Xen's core-scheduling support (which is off by default)

Avoid this by "requesting" threads_per_core times more vCPUs for each VM, which will make the placement algorithm choose the next size up in terms of NUMA nodes (i.e. instead of a single NUMA node use 2,3 as needed, falling back to using all nodes if needed).

The potential gain from reducing memory latency with a NUMA optimized placement (~20% on Intel Memory Latency Checker: Idle latency) is outweighed by the potential loss due to reduced CPU capacity (40%-75% on OpenSSL, POV-Ray, and OpenVINO), so this is the correct tradeoff.

If the NUMA node is large enough, or if the VMs have a small number of vCPUs then we still try to use a single NUMA node as we did previously.

The performance difference can be reproduced and verified easily by running openssl speed -multi 32 rsa4096 on a 32 vCPU VM on a host that has 2 NUMA nodes, with 32 PCPUs each, and 2 threads per core.

@edwintorok
Copy link
Contributor Author

Draft PR because this requires more testing, and perhaps also introducing another NUMA policy enum that preserves the original behaviour, just in case we need it.

Eventually it might also be useful to introduce a NUMA policy for VMs, so that some VMs could be NUMA optimized, and not others. Now that we have proper memory reservation in the new version of Xen, and we know exactly where each VM will go we could do that (in the initial implementation for the old version of Xen we couldn't, because unless all VMs got balanced we couldn't predict how much memory would be left when booting a VM, unless it was all part of a known, small number of NUMA nodes). But that is a larger change that may come in a future PR.

@psafont
Copy link
Member

psafont commented Nov 21, 2025

Does the slowdown happen for any number of vcpus? For example, if a NUMA node has 4 SMT cores and the VM requests 4 vCPUs

@edwintorok
Copy link
Contributor Author

edwintorok commented Nov 21, 2025

Does the slowdown happen for any number of vcpus? For example, if a NUMA node has 4 SMT cores and the VM requests 4 vCPUs

We don't have any CPUs with threads_per_core = 4 in our lab, only 1 or 2. I think architectures other than x86-64 would have SMT4 cores. I used physinfo.threads_per_core to be future proof though (Xeon Phi had 4 threads per core, and maybe they'll try again with other microarchitectures).

I'd expect that the slowdown would be even worse with SMT4, although it really depends which resources are shared between the threads, and which resources are duplicated. For example here https://en.wikichip.org/wiki/intel/microarchitectures/cascade_lake#Scheduler_Ports_.26_Execution_Units there are 7 execution ports shared between all threads on a core. But out of order execution can also use those for instruction-level-parallelism.

If you have good code that takes full advantage of ILP (e.g. most low-level benchmarks, numeric code, etc.), then you might be able to saturate the core with just 1 hyperthread sibling, in this case attempting to use the hyperthread sibling(s) for anything else would slow down both. This is true regardless of whether NUMA optimization is in use or not (but NUMA optimization makes it more likely you hit this problem because you have fewer cores then).
OTOH if you have code that doesn't take full advantage of ILP (e.g. it branches a lot in unpredictable ways, or has a lot of data-dependencies) then you'd have free execution ports, and running something on the hyperthread sibling wouldn't affect the first one much (and overall result in a performance increase because you have now used also the hyperthread siblings so have 2x the processing power).

OTOH my fix only works if the other sibling is idle. If it gets used by a different VM then the situation could be even worse, e.g. the DTLB may not be duplicated for each thread on all microarchitectures. But that can also happen if NUMA optimization is completely turned off.

@mg12 suggested another optimization: try to still run the VM on a single NUMA node, and only when it is too busy spread (the CPUs) out. Although this may be too late because the memory is already bound to a single NUMA node, and Xen doesn't have runtime rebalancing like Linux would.
Anyway the optimization could be tried out if we change the soft affinity mask slightly to use only every threads_per_core in the set, i.e. instead of soft pinning to 32-63, we soft pin to 32,34,...,62. Then if the VM uses at most 16 of its vCPUs at once, it'd still benefit from the NUMA speedup, and once it starts using more, then it'll start running some of its vCPUs elsewhere. TBC whether this actually result in any better performance than just not pinning at all and spreading the memory (so that when vCPUs run elsewhere they have some chance to use local memory).

I can introduce a few more enums in the numa policy, and change the best_effort to be an alias that testing shows results in better performance in a wider set of configurations. The user can then override the policy to something else for different workloads (although this really needs per VM policy support too).

@last-genius
Copy link
Contributor

last-genius commented Nov 21, 2025

We don't have any CPUs with threads_per_core = 4 in our lab, only 1 or 2. I think architectures other than x86-64 would have SMT4 cores. I used physinfo.threads_per_core to be future proof though (Xeon Phi had 4 threads per core, and maybe they'll try again with other microarchitectures).

I think Pau was asking about a NUMA node with 4 threads/2 cores; not an SMT with 4 threads per core - i.e. how do the performance numbers scale in case of smaller NUMA nodes than ones with 32 threads

@edwintorok
Copy link
Contributor Author

We don't have any CPUs with threads_per_core = 4 in our lab, only 1 or 2. I think architectures other than x86-64 would have SMT4 cores. I used physinfo.threads_per_core to be future proof though (Xeon Phi had 4 threads per core, and maybe they'll try again with other microarchitectures).

I think Pau was asking about a NUMA node with 4 threads/2 cores; not an SMT with 4 threads per core - i.e. how do the performance numbers scale in case of smaller NUMA nodes than ones with 32 threads

I'd expect the performance regression to be about the same when looking at relative numbers (and assuming that you now use -multi 4 argument to the OpenSSL benchmark), but that 'd be interesting to confirm.

@edwintorok
Copy link
Contributor Author

edwintorok commented Nov 21, 2025

There is also potentially a problem due to the way we balance VMs across NUMA nodes, we currently only take into account memory. But CPU overload can have a much larger impact than memory latency due to using remote NUMA nodes.
Maybe we'd also need a policy that sorts by most free cpus in a NUMA node first when picking a node, and then by available memory (obviously excluding nodes without enough memory). And keep sorting by memory when picking the number of NUMA nodes to use.

E.g. if you have 1 large VM, 256GiB, and 4 smaller 64GiB, all with same vCPU count. We probably don't want to end up with all 4 of the small ones on the same NUMA node.

@edwintorok edwintorok force-pushed the private/edvint/numafix branch from 0f84e35 to 07a0381 Compare November 25, 2025 18:49
@edwintorok
Copy link
Contributor Author

The previous small patch had a bug, it also tried to iterate beyond the number of vcpus assigned to the VM.
Also the sorting is not quite right, and we should sort by number of free CPUs first.
To avoid a combinatorial explosion in XAPI I've introduced the low level choices in xenopsd as separate record fields,
and at the XAPI level I've only introduced 1 new policy: prio_cpu_mem, which multiplies the number of vcpus a VM has by threads_per_core of the host.
best_effort becomes an alias to this one, and prio_mem_cpu is the old one, should we need it.

Sorting is not yet implemented, and this still needs some testing, so keeping as a draft.

@edwintorok
Copy link
Contributor Author

We should also sort at the pool level by available CPUs first, and then by memory, otherwise a VM with a large amount of memory could create a very unbalanced pool, with some hosts having higher CPU oversubscription than others.
But I'll do that in a separate PR.

@edwintorok
Copy link
Contributor Author

Probably better to avoid looking at threads_per_core, and count the number of cores we see in the topology instead for each NUMA node. I'm not sure what the threads_per_core setting would show in Xen when it is booted with smt=false, but in that case we don't need the doubling.

@edwintorok
Copy link
Contributor Author

Starting to look too complex though, I'll try to simplify, so don't review this yet.

@edwintorok edwintorok force-pushed the private/edvint/numafix branch from 07a0381 to 5266732 Compare November 26, 2025 16:29
@edwintorok
Copy link
Contributor Author

edwintorok commented Nov 26, 2025

I simplified this by rewriting it from scratch, the policy is no longer exposed to XAPI. There is no new escape hatch, we use the previous escape hatch in xenopsd.conf: numa-placement=true which sets the policy to the behaviour prior to this commit. The previous behaviour is renamed Prio_mem_only.

I'll leave the sorting to future PRs, because xenopsd doesn't currently seem to have the required statistics available on how many cpus are free on a numa node.

The actual change is the oneliner in the last commit that changes what Best_effort does, all the previous commits introduce the mechanisms.

Going to test this now.

@edwintorok edwintorok force-pushed the private/edvint/numafix branch from 5266732 to aadaa1e Compare November 27, 2025 09:35
@edwintorok
Copy link
Contributor Author

edwintorok commented Nov 27, 2025

My idea of using the existing numa-placement setting didn't work due to some left-over tech debt. That (deprecated, according to its docs) setting is turned on by default in XS9 (and not the policy in XAPI), because that is how we turn numa on and off between the 2 Xen versions. So it isn't deprecated after all.

I'll introduce a new boolean instead.

@edwintorok edwintorok force-pushed the private/edvint/numafix branch from aadaa1e to 5fb12ea Compare November 27, 2025 10:02
…ode set

Could also compute it by multiplying it with [threads_per_core],
but I'm not sure how that'd interact with [smt=false] in Xen.
Also to future-proof this I wouldn't want to rely on an entirely
symmetrical architecture
(although it'd be very rare to have anything other than 2 on x86-64,
 or to have hyperthreading on in one socket, and off in another).

Note that core ids are not unique (there is a core `0` on both socket 0 and
socket 1 for example), so only work with number of cores in the topology code.

Could've created a CoreSocketSet instead (assuming that no higher grouping than
sockets would exist in the future), but for now don't make too many assumptions
about topology.

No functional change.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
@edwintorok edwintorok force-pushed the private/edvint/numafix branch from 5fb12ea to 4408030 Compare November 27, 2025 10:28
The planner explicitly looks at the NUMARequest fields and checks that they are
non-zero.
However if more fields get added in the future this leads to an assertion
failure, where the planner thinks it has found a solution, but NUMARequest.fits
returns false.

Ensure consistency: use `fits` in the planner to check that we've reached a
solution. If the remaining request doesn't fit into an empty node, then the
request is not empty.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
The requested number of cores is still 0, so no functional change.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
…io_mem_only

The current NUMA policy prioritizes reducing cross-NUMA node memory traffic by
picking the smallest set of NUMA nodes that fit a VM.
It doesn't look at how this affects CPU overload within a NUMA node, or whether
the local bandwidth of each NUMA node is balanced or not.

Give this policy an explicit name, `Prio_mem_only`, and when the "compat" setting
in `xenopsd.conf` is used (`numa-placement=true`), then explicitly use this
policy instead of Best-effort.

Currently Best-effort is still equivalent to this policy, but that'll change in
a follow-up commit.
Introduce a new xenopsd.conf entry `numa-best-effort-prio-mem-only`,
which can be used to explicitly revert best effort to the current policy.
(currently this is a no-op, because there is only one best-effort policy).

Future policies should also look at CPU overload.

No functional change.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
NUMA optimized placement can have a large performance hit on machines with
small NUMA nodes and VMs with a large number of vCPUs.
For example a machine that has 2 sockets, which can run at most 32 vCPUs in a
single socket (NUMA node), and a VM with 32 vCPUs.

Usually Xen would try to spread the load across actual cores, and avoid the
hyperthread siblings, e.g. using CPUs 0,2,4,etc.
But when NUMA placement is used all the vCPUs must be in the same NUMA node.
If that NUMA node doesn't have enough cores, then Xen will have no choice but
to use CPUs 0,1,2,3,etc.

Hyperthread siblings share resources, and if you try to use both at the same
time you get a big performance hit, depending on the workload.

Avoid this by "requesting" cores=vcpus for each VM,
which will make the placement algorithm choose the next size up in terms of
NUMA nodes (i.e. instead of 1 NUMA node, use 2,3 as needed, falling back to using
all nodes if needed).

The potential gain from reducing memory latency with a NUMA optimized placement
(~20% on Intel Memory Latency Checker: Idle latency) is outweighed by
the potential loss due to reduced CPU capacity (40%-75% on OpenSSL, POV-Ray, and
OpenVINO), so this is the correct trade-off.

If the NUMA node is large enough, or if the VMs have a small number of vCPUs
then we still try to use a single NUMA node as we did previously.

The performance difference can be reproduced and verified easily by running
`openssl speed -multi 32 rsa4096` on a 32 vCPU VM on a host that has 2 NUMA
nodes, with 32 PCPUs each, and 2 threads per core.
This introduces a policy that can control whether we want to filter out
NUMA nodes with too few cores.

Although we want to enable this filter by default, we still want
an "escape hatch" to turn it off if we find problems with it.
That is why the "compat" setting (numa_placement=true) in xenopsd.conf
reverts back to the old policy, which is now named explicitly as Prio_mem_only.

There could still be workloads where optimizing for memory bandwidth makes more
sense (although that is a property of the NUMA node, not of individual VMs),
so although it might be desirable for this to be a VM policy, it cannot,
because it affects other VMs too.

TODO: when sched-gran=core this should be turned off. That always has the
performance hit, so might as well use smaller NUMA nodes if available.

For now this isn't exposed yet as a XAPI-level policy, because that requires
more changes (to also sort by free cores on a node, and to also sort at the
pool level by free cpus on a host).
Once we have those changes we can introduce a new policy `prio_core_mem`
to sort by free cores first, then by free memory, and requires cores>=vcpus
(i.e. cpus>=vcpus*threads_per_cores) when choosing a node.

This changes the default to the new setting, which should be equal or an
improvement in the general case.
An "escape hatch" to revert to the previous behaviour is to set
`numa-placement=true` in xenopsd.conf, and the XAPI host-level policy to
'default_policy'.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
@edwintorok edwintorok force-pushed the private/edvint/numafix branch from 4408030 to b097854 Compare November 27, 2025 10:46
@edwintorok
Copy link
Contributor Author

I've tested this on a host and it produced the expected NUMA node assignment: with 32 vCPU VM on a 2*32 CPU host, with 2 NUMA nodes, it assigns the VM to all nodes. If I reduce the number of vCPUs to 16, it assigns it to just one.

I also ran the existing NUMA test suites, but they keep running into preexisting bugs (they're a bit too sensitive, and complain about a 10.2% imbalance when the threshold is 10%; or about running out of memory starting a VM when the host has 21.8 GiB free on a NUMA node, and the VM was 22 GiB, and so on). None of those seem to be related to my changes, because the VMs don't have enough vCPUs to hit the node limits on the machines it ran on.

@edwintorok edwintorok marked this pull request as ready for review November 27, 2025 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants