[RFC]: Re-evaluating `auto` Thread Binding in `VLLM_CPU_OMP_THREADS_BIND` for SMT Architectures

### Motivation.

The current implementation of the `auto` ([PR #17930](https://github.com/vllm-project/vllm/pull/17930)) mode in `VLLM_CPU_OMP_THREADS_BIND` estimates thread binding based on the number of **physical cores per NUMA node**, using logic like:

```python
psutil.cpu_count(logical=False) // numa_nodes
```

This approach ignores logical CPUs enabled via Simultaneous Multithreading (SMT), which is present in IBM POWER systems. They have 2,4 and 8 threads per core

As a result, auto mode often ends up using only a fraction of available compute resources. For instance, on a 4-socket POWER10 system with 384 logical CPUs (96 per NUMA node), only 12 CPUs were being bound per worker — leading to 2–3x lower throughput compared to manually binding all logical CPUs.

```bash
Architecture:             ppc64le
  Byte Order:             Little Endian
CPU(s):                   384
  On-line CPU(s) list:    0-383
Model name:               POWER10 (architected), altivec supported
  Model:                  2.0 (pvr 0080 0200)
  Thread(s) per core:     8
  Core(s) per socket:     12
  Socket(s):              4
Virtualization features:  
  Hypervisor vendor:      pHyp
  Virtualization type:    para
Caches (sum of all):      
  L1d:                    3 MiB (96 instances)
  L1i:                    4.5 MiB (96 instances)
  L2:                     96 MiB (96 instances)
  L3:                     384 MiB (96 instances)
NUMA:                     
  NUMA node(s):           4
  NUMA node0 CPU(s):      0-95
  NUMA node1 CPU(s):      96-191
  NUMA node2 CPU(s):      192-287
  NUMA node3 CPU(s):      288-383
Vulnerabilities:          
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Not affected
  Spectre v1:             Mitigation; __user pointer sanitization, ori31 speculation barrier enabled
  Spectre v2:             Mitigation; Software count cache flush (hardware accelerated), Software link stack flush
  Srbds:                  Not affected
  Tsx async abort:        Not affected
```


```bash

INFO   06-23 06:52:34 [cpu_worker.py:443] auto thread-binding list:   0,1,2,3,4,5,6,7,8,9,10,11
--
INFO 06-23 06:52:34   [cpu.py:69] Using Torch SDPA backend.
INFO 06-23 06:52:34   [cpu_worker.py:226] OMP threads binding of Process 1755761:
INFO 06-23 06:52:34   [cpu_worker.py:226] OMP tid: 1755761, core 0
INFO 06-23 06:52:34   [cpu_worker.py:226] OMP tid: 1755782, core 1
INFO 06-23 06:52:34   [cpu_worker.py:226] OMP tid: 1755783, core 2
INFO 06-23 06:52:34   [cpu_worker.py:226] OMP tid: 1755784, core 3
INFO 06-23 06:52:34   [cpu_worker.py:226] OMP tid: 1755785, core 4
INFO 06-23 06:52:34   [cpu_worker.py:226] OMP tid: 1755786, core 5
INFO 06-23 06:52:34   [cpu_worker.py:226] OMP tid: 1755787, core 6
INFO 06-23 06:52:34   [cpu_worker.py:226] OMP tid: 1755788, core 7
INFO 06-23 06:52:34   [cpu_worker.py:226] OMP tid: 1755789, core 8
INFO 06-23 06:52:34   [cpu_worker.py:226] OMP tid: 1755790, core 9
INFO 06-23 06:52:34   [cpu_worker.py:226] OMP tid: 1755791, core 10
INFO 06-23 06:52:34   [cpu_worker.py:226] OMP tid: 1755792, core 11
INFO 06-23 06:52:34   [cpu_worker.py:226]
INFO 06-23 06:52:34   [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP   rank 0, TP rank 0, EP rank 0
```

We believe this warrants a rethink of how auto mode determines thread binding, especially in SMT-rich environments.

### Proposed Change.

Below is the updated get_cpus_id_binding_based_on_numa_nodes function we used to test the proposed changes.

```python
 def get_cache_block_size_bytes(self) -> int:
        """Return the size in bytes of a single KV cache block.
        """
        return CPUCacheEngine.get_cache_block_size(
            self.cache_config.block_size, self.cache_config.cache_dtype,
            self.model_config, self.parallel_config)

    # ---
    # Patch Note: NUMA-Aware CPU Binding Logic
    #
    # Previous logic:
    #   - Used psutil.cpu_count(logical=False) to get the number of physical cores on the system.
    #   - Divided by the number of NUMA nodes to estimate CPUs per node.
    #   - This undercounted CPUs on SMT systems (e.g., Power10, x86 with hyperthreading),
    #     resulting in only one thread per physical core being used per worker.
    #
    # New logic (this patch):
    #   - Uses numa.info.node_to_cpus(node_id) to get the actual list of logical CPUs in the NUMA node.
    #   - Uses all logical CPUs available in the node for binding, fully utilizing SMT/hyperthreads.
    #   - This is correct for all modern architectures (Power, x86, ARM, etc.) and matches what OpenMP expects.
    #
    # Why this matters:
    #   - On SMT systems, the old logic severely underutilized hardware (e.g., 12 vs 96 threads per node on Power10).
    #   - The new logic ensures maximum parallelism and throughput, and is robust to any CPU topology.
    #
    # User override:
    #   - Users can always set VLLM_CPU_OMP_THREADS_BIND=all to skip binding and use all allowed CPUs.
    #   - Or specify a custom list (e.g., 0-95|96-191) for manual control.
    # 
    # ---
    def get_cpus_id_binding_based_on_numa_nodes(self) -> str:
        """Return CPUs id binding based on NUMA nodes, with debug prints and correct logical CPU counting."""
        rank_to_cpus = self.local_omp_cpuid
        # Setup OpenMP thread affinity based on NUMA nodes automatically
        world_size = self.vllm_config.parallel_config.world_size
        libnuma_found = util.find_spec("numa") is not None
        psutil_found = util.find_spec("psutil") is not None
        if libnuma_found and psutil_found:
            import psutil
            from numa import info
            cpus_allow_list = psutil.Process().cpu_affinity()
            numa_size = info.get_num_configured_nodes()

            for i in range(numa_size):
                node_cpus = set(info.node_to_cpus(i))

            # check allow node_to_cpus list
            node_to_cpus = []
            for i in range(numa_size):
                node_intersect = set(
                    info.node_to_cpus(i)).intersection(cpus_allow_list)
                if bool(node_intersect):
                    node_to_cpus.append(sorted(list(node_intersect)))

            if world_size > len(node_to_cpus):
                logger.error(
                    "Auto thread-binding failed due to "
                    "world size: %d is larger than "
                    "allowed NUMA nodes number: %d."
                    "Please try to bind threads manually.", world_size,
                    len(node_to_cpus))
            else:
                # FIX: Use the actual number of logical CPUs in the NUMA node
                node_cpus_this_rank = node_to_cpus[self.rank]
                cpu_count_per_numa = len(node_cpus_this_rank)
                num_of_reserved_cpu = min(envs.VLLM_CPU_NUM_OF_RESERVED_CPU,
                                          cpu_count_per_numa // 2)
                end = cpu_count_per_numa - num_of_reserved_cpu
                rank_to_cpus_list = node_cpus_this_rank[:end]
                rank_to_cpus = ','.join(str(x) for x in rank_to_cpus_list)
                logger.info("auto thread-binding list: %s", rank_to_cpus)
        else:
            logger.warning(
                "Auto thread-binding is not supported due to "
                "the lack of package numa and psutil,"
                "fallback to no thread-binding. To get better performance,"
                "please try to manually bind threads.")
        return rank_to_cpus

```

- Switches from physical core count to actual logical CPU list per NUMA node using numa.info.node_to_cpus.
- Honors CPU affinity by intersecting with psutil.Process().cpu_affinity().
- Uses all available logical CPUs in the NUMA node, minus a small reserve (configurable).

We have tested this on IBM POWER10 with SMT-8 and this way it gives performance very close to `VLLM_CPU_OMP_THREADS_BIND=all`. 


## Comments / Feedback Requested: 
1. Although this logic is tested for IBM POWER, I am not sure how it could be made generic so that other architecture are also honoured. 
2. Should it be conditionally enabled for IBM POWER ?
3. It there need of some other logic and checks to make the logic SMT/Hyperthreading friendly for every architecture. 

### Feedback Period.

_No response_

### CC List.

@simon-mo @DarkLight1337 @louie-tsai @bigPYJ1151 @askervin 

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Re-evaluating `auto` Thread Binding in `VLLM_CPU_OMP_THREADS_BIND` for SMT Architectures #20089

Motivation.

Proposed Change.

Comments / Feedback Requested:

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Re-evaluating auto Thread Binding in VLLM_CPU_OMP_THREADS_BIND for SMT Architectures #20089

Description

Motivation.

Proposed Change.

Comments / Feedback Requested:

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[RFC]: Re-evaluating `auto` Thread Binding in `VLLM_CPU_OMP_THREADS_BIND` for SMT Architectures #20089