Skip to content

[RFC]: Re-evaluating auto Thread Binding in VLLM_CPU_OMP_THREADS_BIND for SMT Architectures #20089

Open
@Akashcodes732

Description

@Akashcodes732

Motivation.

The current implementation of the auto (PR #17930) mode in VLLM_CPU_OMP_THREADS_BIND estimates thread binding based on the number of physical cores per NUMA node, using logic like:

psutil.cpu_count(logical=False) // numa_nodes

This approach ignores logical CPUs enabled via Simultaneous Multithreading (SMT), which is present in IBM POWER systems. They have 2,4 and 8 threads per core

As a result, auto mode often ends up using only a fraction of available compute resources. For instance, on a 4-socket POWER10 system with 384 logical CPUs (96 per NUMA node), only 12 CPUs were being bound per worker — leading to 2–3x lower throughput compared to manually binding all logical CPUs.

Architecture:             ppc64le
  Byte Order:             Little Endian
CPU(s):                   384
  On-line CPU(s) list:    0-383
Model name:               POWER10 (architected), altivec supported
  Model:                  2.0 (pvr 0080 0200)
  Thread(s) per core:     8
  Core(s) per socket:     12
  Socket(s):              4
Virtualization features:  
  Hypervisor vendor:      pHyp
  Virtualization type:    para
Caches (sum of all):      
  L1d:                    3 MiB (96 instances)
  L1i:                    4.5 MiB (96 instances)
  L2:                     96 MiB (96 instances)
  L3:                     384 MiB (96 instances)
NUMA:                     
  NUMA node(s):           4
  NUMA node0 CPU(s):      0-95
  NUMA node1 CPU(s):      96-191
  NUMA node2 CPU(s):      192-287
  NUMA node3 CPU(s):      288-383
Vulnerabilities:          
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Not affected
  Spectre v1:             Mitigation; __user pointer sanitization, ori31 speculation barrier enabled
  Spectre v2:             Mitigation; Software count cache flush (hardware accelerated), Software link stack flush
  Srbds:                  Not affected
  Tsx async abort:        Not affected
INFO   06-23 06:52:34 [cpu_worker.py:443] auto thread-binding list:   0,1,2,3,4,5,6,7,8,9,10,11
--
INFO 06-23 06:52:34   [cpu.py:69] Using Torch SDPA backend.
INFO 06-23 06:52:34   [cpu_worker.py:226] OMP threads binding of Process 1755761:
INFO 06-23 06:52:34   [cpu_worker.py:226] OMP tid: 1755761, core 0
INFO 06-23 06:52:34   [cpu_worker.py:226] OMP tid: 1755782, core 1
INFO 06-23 06:52:34   [cpu_worker.py:226] OMP tid: 1755783, core 2
INFO 06-23 06:52:34   [cpu_worker.py:226] OMP tid: 1755784, core 3
INFO 06-23 06:52:34   [cpu_worker.py:226] OMP tid: 1755785, core 4
INFO 06-23 06:52:34   [cpu_worker.py:226] OMP tid: 1755786, core 5
INFO 06-23 06:52:34   [cpu_worker.py:226] OMP tid: 1755787, core 6
INFO 06-23 06:52:34   [cpu_worker.py:226] OMP tid: 1755788, core 7
INFO 06-23 06:52:34   [cpu_worker.py:226] OMP tid: 1755789, core 8
INFO 06-23 06:52:34   [cpu_worker.py:226] OMP tid: 1755790, core 9
INFO 06-23 06:52:34   [cpu_worker.py:226] OMP tid: 1755791, core 10
INFO 06-23 06:52:34   [cpu_worker.py:226] OMP tid: 1755792, core 11
INFO 06-23 06:52:34   [cpu_worker.py:226]
INFO 06-23 06:52:34   [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP   rank 0, TP rank 0, EP rank 0

We believe this warrants a rethink of how auto mode determines thread binding, especially in SMT-rich environments.

Proposed Change.

Below is the updated get_cpus_id_binding_based_on_numa_nodes function we used to test the proposed changes.

 def get_cache_block_size_bytes(self) -> int:
        """Return the size in bytes of a single KV cache block.
        """
        return CPUCacheEngine.get_cache_block_size(
            self.cache_config.block_size, self.cache_config.cache_dtype,
            self.model_config, self.parallel_config)

    # ---
    # Patch Note: NUMA-Aware CPU Binding Logic
    #
    # Previous logic:
    #   - Used psutil.cpu_count(logical=False) to get the number of physical cores on the system.
    #   - Divided by the number of NUMA nodes to estimate CPUs per node.
    #   - This undercounted CPUs on SMT systems (e.g., Power10, x86 with hyperthreading),
    #     resulting in only one thread per physical core being used per worker.
    #
    # New logic (this patch):
    #   - Uses numa.info.node_to_cpus(node_id) to get the actual list of logical CPUs in the NUMA node.
    #   - Uses all logical CPUs available in the node for binding, fully utilizing SMT/hyperthreads.
    #   - This is correct for all modern architectures (Power, x86, ARM, etc.) and matches what OpenMP expects.
    #
    # Why this matters:
    #   - On SMT systems, the old logic severely underutilized hardware (e.g., 12 vs 96 threads per node on Power10).
    #   - The new logic ensures maximum parallelism and throughput, and is robust to any CPU topology.
    #
    # User override:
    #   - Users can always set VLLM_CPU_OMP_THREADS_BIND=all to skip binding and use all allowed CPUs.
    #   - Or specify a custom list (e.g., 0-95|96-191) for manual control.
    # 
    # ---
    def get_cpus_id_binding_based_on_numa_nodes(self) -> str:
        """Return CPUs id binding based on NUMA nodes, with debug prints and correct logical CPU counting."""
        rank_to_cpus = self.local_omp_cpuid
        # Setup OpenMP thread affinity based on NUMA nodes automatically
        world_size = self.vllm_config.parallel_config.world_size
        libnuma_found = util.find_spec("numa") is not None
        psutil_found = util.find_spec("psutil") is not None
        if libnuma_found and psutil_found:
            import psutil
            from numa import info
            cpus_allow_list = psutil.Process().cpu_affinity()
            numa_size = info.get_num_configured_nodes()

            for i in range(numa_size):
                node_cpus = set(info.node_to_cpus(i))

            # check allow node_to_cpus list
            node_to_cpus = []
            for i in range(numa_size):
                node_intersect = set(
                    info.node_to_cpus(i)).intersection(cpus_allow_list)
                if bool(node_intersect):
                    node_to_cpus.append(sorted(list(node_intersect)))

            if world_size > len(node_to_cpus):
                logger.error(
                    "Auto thread-binding failed due to "
                    "world size: %d is larger than "
                    "allowed NUMA nodes number: %d."
                    "Please try to bind threads manually.", world_size,
                    len(node_to_cpus))
            else:
                # FIX: Use the actual number of logical CPUs in the NUMA node
                node_cpus_this_rank = node_to_cpus[self.rank]
                cpu_count_per_numa = len(node_cpus_this_rank)
                num_of_reserved_cpu = min(envs.VLLM_CPU_NUM_OF_RESERVED_CPU,
                                          cpu_count_per_numa // 2)
                end = cpu_count_per_numa - num_of_reserved_cpu
                rank_to_cpus_list = node_cpus_this_rank[:end]
                rank_to_cpus = ','.join(str(x) for x in rank_to_cpus_list)
                logger.info("auto thread-binding list: %s", rank_to_cpus)
        else:
            logger.warning(
                "Auto thread-binding is not supported due to "
                "the lack of package numa and psutil,"
                "fallback to no thread-binding. To get better performance,"
                "please try to manually bind threads.")
        return rank_to_cpus
  • Switches from physical core count to actual logical CPU list per NUMA node using numa.info.node_to_cpus.
  • Honors CPU affinity by intersecting with psutil.Process().cpu_affinity().
  • Uses all available logical CPUs in the NUMA node, minus a small reserve (configurable).

We have tested this on IBM POWER10 with SMT-8 and this way it gives performance very close to VLLM_CPU_OMP_THREADS_BIND=all.

Comments / Feedback Requested:

  1. Although this logic is tested for IBM POWER, I am not sure how it could be made generic so that other architecture are also honoured.
  2. Should it be conditionally enabled for IBM POWER ?
  3. It there need of some other logic and checks to make the logic SMT/Hyperthreading friendly for every architecture.

Feedback Period.

No response

CC List.

@simon-mo @DarkLight1337 @louie-tsai @bigPYJ1151 @askervin

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions