Description
Motivation.
The current implementation of the auto
(PR #17930) mode in VLLM_CPU_OMP_THREADS_BIND
estimates thread binding based on the number of physical cores per NUMA node, using logic like:
psutil.cpu_count(logical=False) // numa_nodes
This approach ignores logical CPUs enabled via Simultaneous Multithreading (SMT), which is present in IBM POWER systems. They have 2,4 and 8 threads per core
As a result, auto mode often ends up using only a fraction of available compute resources. For instance, on a 4-socket POWER10 system with 384 logical CPUs (96 per NUMA node), only 12 CPUs were being bound per worker — leading to 2–3x lower throughput compared to manually binding all logical CPUs.
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 384
On-line CPU(s) list: 0-383
Model name: POWER10 (architected), altivec supported
Model: 2.0 (pvr 0080 0200)
Thread(s) per core: 8
Core(s) per socket: 12
Socket(s): 4
Virtualization features:
Hypervisor vendor: pHyp
Virtualization type: para
Caches (sum of all):
L1d: 3 MiB (96 instances)
L1i: 4.5 MiB (96 instances)
L2: 96 MiB (96 instances)
L3: 384 MiB (96 instances)
NUMA:
NUMA node(s): 4
NUMA node0 CPU(s): 0-95
NUMA node1 CPU(s): 96-191
NUMA node2 CPU(s): 192-287
NUMA node3 CPU(s): 288-383
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Reg file data sampling: Not affected
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Not affected
Spectre v1: Mitigation; __user pointer sanitization, ori31 speculation barrier enabled
Spectre v2: Mitigation; Software count cache flush (hardware accelerated), Software link stack flush
Srbds: Not affected
Tsx async abort: Not affected
INFO 06-23 06:52:34 [cpu_worker.py:443] auto thread-binding list: 0,1,2,3,4,5,6,7,8,9,10,11
--
INFO 06-23 06:52:34 [cpu.py:69] Using Torch SDPA backend.
INFO 06-23 06:52:34 [cpu_worker.py:226] OMP threads binding of Process 1755761:
INFO 06-23 06:52:34 [cpu_worker.py:226] OMP tid: 1755761, core 0
INFO 06-23 06:52:34 [cpu_worker.py:226] OMP tid: 1755782, core 1
INFO 06-23 06:52:34 [cpu_worker.py:226] OMP tid: 1755783, core 2
INFO 06-23 06:52:34 [cpu_worker.py:226] OMP tid: 1755784, core 3
INFO 06-23 06:52:34 [cpu_worker.py:226] OMP tid: 1755785, core 4
INFO 06-23 06:52:34 [cpu_worker.py:226] OMP tid: 1755786, core 5
INFO 06-23 06:52:34 [cpu_worker.py:226] OMP tid: 1755787, core 6
INFO 06-23 06:52:34 [cpu_worker.py:226] OMP tid: 1755788, core 7
INFO 06-23 06:52:34 [cpu_worker.py:226] OMP tid: 1755789, core 8
INFO 06-23 06:52:34 [cpu_worker.py:226] OMP tid: 1755790, core 9
INFO 06-23 06:52:34 [cpu_worker.py:226] OMP tid: 1755791, core 10
INFO 06-23 06:52:34 [cpu_worker.py:226] OMP tid: 1755792, core 11
INFO 06-23 06:52:34 [cpu_worker.py:226]
INFO 06-23 06:52:34 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
We believe this warrants a rethink of how auto mode determines thread binding, especially in SMT-rich environments.
Proposed Change.
Below is the updated get_cpus_id_binding_based_on_numa_nodes function we used to test the proposed changes.
def get_cache_block_size_bytes(self) -> int:
"""Return the size in bytes of a single KV cache block.
"""
return CPUCacheEngine.get_cache_block_size(
self.cache_config.block_size, self.cache_config.cache_dtype,
self.model_config, self.parallel_config)
# ---
# Patch Note: NUMA-Aware CPU Binding Logic
#
# Previous logic:
# - Used psutil.cpu_count(logical=False) to get the number of physical cores on the system.
# - Divided by the number of NUMA nodes to estimate CPUs per node.
# - This undercounted CPUs on SMT systems (e.g., Power10, x86 with hyperthreading),
# resulting in only one thread per physical core being used per worker.
#
# New logic (this patch):
# - Uses numa.info.node_to_cpus(node_id) to get the actual list of logical CPUs in the NUMA node.
# - Uses all logical CPUs available in the node for binding, fully utilizing SMT/hyperthreads.
# - This is correct for all modern architectures (Power, x86, ARM, etc.) and matches what OpenMP expects.
#
# Why this matters:
# - On SMT systems, the old logic severely underutilized hardware (e.g., 12 vs 96 threads per node on Power10).
# - The new logic ensures maximum parallelism and throughput, and is robust to any CPU topology.
#
# User override:
# - Users can always set VLLM_CPU_OMP_THREADS_BIND=all to skip binding and use all allowed CPUs.
# - Or specify a custom list (e.g., 0-95|96-191) for manual control.
#
# ---
def get_cpus_id_binding_based_on_numa_nodes(self) -> str:
"""Return CPUs id binding based on NUMA nodes, with debug prints and correct logical CPU counting."""
rank_to_cpus = self.local_omp_cpuid
# Setup OpenMP thread affinity based on NUMA nodes automatically
world_size = self.vllm_config.parallel_config.world_size
libnuma_found = util.find_spec("numa") is not None
psutil_found = util.find_spec("psutil") is not None
if libnuma_found and psutil_found:
import psutil
from numa import info
cpus_allow_list = psutil.Process().cpu_affinity()
numa_size = info.get_num_configured_nodes()
for i in range(numa_size):
node_cpus = set(info.node_to_cpus(i))
# check allow node_to_cpus list
node_to_cpus = []
for i in range(numa_size):
node_intersect = set(
info.node_to_cpus(i)).intersection(cpus_allow_list)
if bool(node_intersect):
node_to_cpus.append(sorted(list(node_intersect)))
if world_size > len(node_to_cpus):
logger.error(
"Auto thread-binding failed due to "
"world size: %d is larger than "
"allowed NUMA nodes number: %d."
"Please try to bind threads manually.", world_size,
len(node_to_cpus))
else:
# FIX: Use the actual number of logical CPUs in the NUMA node
node_cpus_this_rank = node_to_cpus[self.rank]
cpu_count_per_numa = len(node_cpus_this_rank)
num_of_reserved_cpu = min(envs.VLLM_CPU_NUM_OF_RESERVED_CPU,
cpu_count_per_numa // 2)
end = cpu_count_per_numa - num_of_reserved_cpu
rank_to_cpus_list = node_cpus_this_rank[:end]
rank_to_cpus = ','.join(str(x) for x in rank_to_cpus_list)
logger.info("auto thread-binding list: %s", rank_to_cpus)
else:
logger.warning(
"Auto thread-binding is not supported due to "
"the lack of package numa and psutil,"
"fallback to no thread-binding. To get better performance,"
"please try to manually bind threads.")
return rank_to_cpus
- Switches from physical core count to actual logical CPU list per NUMA node using numa.info.node_to_cpus.
- Honors CPU affinity by intersecting with psutil.Process().cpu_affinity().
- Uses all available logical CPUs in the NUMA node, minus a small reserve (configurable).
We have tested this on IBM POWER10 with SMT-8 and this way it gives performance very close to VLLM_CPU_OMP_THREADS_BIND=all
.
Comments / Feedback Requested:
- Although this logic is tested for IBM POWER, I am not sure how it could be made generic so that other architecture are also honoured.
- Should it be conditionally enabled for IBM POWER ?
- It there need of some other logic and checks to make the logic SMT/Hyperthreading friendly for every architecture.
Feedback Period.
No response
CC List.
@simon-mo @DarkLight1337 @louie-tsai @bigPYJ1151 @askervin
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.