Skip to content

Conversation

@yewentao256
Copy link
Member

@yewentao256 yewentao256 commented Aug 21, 2025

Purpose

Only Print Profiler Results on Rank 0

Currently we print everything in each DP rank, which is a little bit too much, eg:

(EngineCore_7 pid=1389347)                                                    Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
(EngineCore_7 pid=1389347) -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
(EngineCore_7 pid=1389347) void deep_ep::internode_ll::dispatch<true, false, 71...         0.00%       0.000us         0.00%       0.000us       0.000us     850.380ms        33.33%     850.380ms     418.906us          2030  
(EngineCore_7 pid=1389347)                                       vllm::moe_forward        12.14%     274.842ms        23.24%     526.391ms       1.297ms     734.719ms        28.80%     817.628ms       2.014ms           406  
(EngineCore_7 pid=1389347) void deep_ep::internode_ll::combine<false, 7168, 9>(...         0.00%       0.000us         0.00%       0.000us       0.000us     474.299ms        18.59%     474.299ms     233.645us          2030  
(EngineCore_7 pid=1389347) void at::native::(anonymous namespace)::indexSelectL...         0.00%       0.000us         0.00%       0.000us       0.000us     209.002ms         8.19%     209.002ms      51.478us          4060  
(EngineCore_7 pid=1389347) void cutlass::device_kernel<vllm::cutlass_3x_gemm_fp...         0.00%       0.000us         0.00%       0.000us       0.000us     168.010ms         6.59%     168.010ms      24.898us          6748  
(EngineCore_7 pid=1389347)                           _silu_mul_fp8_quant_deep_gemm         0.00%       0.000us         0.00%       0.000us       0.000us      75.421ms         2.96%      75.421ms      37.153us          2030  
(EngineCore_7 pid=1389347) void cutlass::device_kernel<vllm::cutlass_3x_gemm_fp...         0.00%       0.000us         0.00%       0.000us       0.000us      75.404ms         2.96%      75.404ms      42.078us          1792  
(EngineCore_7 pid=1389347) void deep_gemm::sm100_fp8_gemm_1d1d_impl<(cute::UMMA...         0.00%       0.000us         0.00%       0.000us       0.000us      74.084ms         2.90%      74.084ms      36.495us          2030  
(EngineCore_7 pid=1389347) void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us      52.981ms         2.08%      52.981ms      13.049us          4060  
(EngineCore_7 pid=1389347) void cutlass::device_kernel<cutlass::fmha::kernel::S...         0.00%       0.000us         0.00%       0.000us       0.000us      46.967ms         1.84%      46.967ms      27.498us          1708  
(EngineCore_7 pid=1389347) void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us      36.980ms         1.45%      36.980ms       9.108us          4060  
(EngineCore_7 pid=1389347)                                   _C::cutlass_scaled_mm         0.52%      11.794ms         1.15%      26.127ms      12.237us      35.828ms         1.40%      35.828ms      16.781us          2135  
(EngineCore_7 pid=1389347) void cutlass::device_kernel<vllm::cutlass_3x_gemm_fp...         0.00%       0.000us         0.00%       0.000us       0.000us      35.828ms         1.40%      35.828ms      16.781us          2135  
(EngineCore_7 pid=1389347) void per_token_group_quant_8bit_kernel<c10::BFloat16...         0.00%       0.000us         0.00%       0.000us       0.000us      33.460ms         1.31%      33.460ms       3.918us          8540  
(EngineCore_7 pid=1389347) void deep_gemm::sm100_fp8_gemm_1d1d_impl<(cute::UMMA...         0.00%       0.000us         0.00%       0.000us       0.000us      33.336ms         1.31%      33.336ms      16.422us          2030  
(EngineCore_7 pid=1389347)                                      aten::index_select         0.22%       4.965ms         0.46%      10.363ms      12.546us      32.687ms         1.28%      32.687ms      39.573us           826  
(EngineCore_7 pid=1389347) void at::native::sbtopk::gatherTopK<c10::BFloat16, u...         0.00%       0.000us         0.00%       0.000us       0.000us      30.652ms         1.20%      30.652ms       5.807us          5278  
(EngineCore_7 pid=1389347) void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us      29.550ms         1.16%      29.550ms       7.278us          4060  
(EngineCore_7 pid=1389347)                                             aten::copy_         0.90%      20.312ms        48.84%        1.106s     131.000us      24.063ms         0.94%      24.064ms       2.850us          8442  
(EngineCore_7 pid=1389347) void deep_gemm::transpose_and_pack_fp32_into_ue8m0<5...         0.00%       0.000us         0.00%       0.000us       0.000us      21.320ms         0.84%      21.320ms      10.502us          2030  
(EngineCore_7 pid=1389347) void deep_gemm::transpose_and_pack_fp32_into_ue8m0<5...         0.00%       0.000us         0.00%       0.000us       0.000us      20.630ms         0.81%      20.630ms      10.163us          2030  
(EngineCore_7 pid=1389347) triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_vie...         0.00%       0.000us         0.00%       0.000us       0.000us      17.816ms         0.70%      17.816ms       8.777us          2030  
(EngineCore_7 pid=1389347) void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us      16.658ms         0.65%      16.658ms       3.059us          5446  
(EngineCore_7 pid=1389347) void at::native::elementwise_kernel<128, 2, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us      16.014ms         0.63%      16.014ms       3.944us          4060  
(EngineCore_7 pid=1389347)                       nvjet_tst_8x64_64x16_4x1_v_bz_TNN         0.00%       0.000us         0.00%       0.000us       0.000us      13.358ms         0.52%      13.358ms      10.967us          1218  
(EngineCore_7 pid=1389347)                 nvjet_tst_64x64_64x16_2x1_2cta_v_bz_TNT         0.00%       0.000us         0.00%       0.000us       0.000us      13.092ms         0.51%      13.092ms      15.331us           854  
(EngineCore_7 pid=1389347)                                           memcpy32_post         0.00%       0.000us         0.00%       0.000us       0.000us      12.766ms         0.50%      12.766ms       1.860us          6862  
(EngineCore_7 pid=1389347) triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_vie...         0.00%       0.000us         0.00%       0.000us       0.000us      11.460ms         0.45%      11.460ms       5.645us          2030  
(EngineCore_7 pid=1389347) void at::native::vectorized_elementwise_kernel<2, at...         0.00%       0.000us         0.00%       0.000us       0.000us      10.686ms         0.42%      10.686ms       2.632us          4060  
(EngineCore_7 pid=1389347)                         triton_poi_fused_add_copy_mul_4         0.00%       0.000us         0.00%       0.000us       0.000us      10.460ms         0.41%      10.460ms       4.981us          2100  
(EngineCore_7 pid=1389347) void at::native::bitonicSortKVInPlace<2, -1, 16, 16,...         0.00%       0.000us         0.00%       0.000us       0.000us      10.229ms         0.40%      10.229ms       5.039us          2030  
(EngineCore_7 pid=1389347)                                              aten::topk         0.47%      10.553ms         0.84%      19.126ms      15.703us       9.548ms         0.37%       9.548ms       7.839us          1218  
(EngineCore_7 pid=1389347) void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us       9.318ms         0.37%       9.318ms       2.295us          4060  
(EngineCore_7 pid=1389347)                                                aten::mm         0.88%      20.023ms         6.82%     154.494ms     367.843us       9.216ms         0.36%       9.862ms      23.481us           420  
(EngineCore_7 pid=1389347)                       nvjet_tst_64x8_64x16_2x4_v_bz_TNT         0.00%       0.000us         0.00%       0.000us       0.000us       8.738ms         0.34%       8.738ms      10.761us           812  
(EngineCore_7 pid=1389347) void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       7.178ms         0.28%       7.178ms       1.768us          4060  
(EngineCore_7 pid=1389347)                                               memcpy128         0.00%       0.000us         0.00%       0.000us       0.000us       7.114ms         0.28%       7.114ms       2.333us          3050  
(EngineCore_7 pid=1389347)                 nvjet_tst_64x64_64x16_2x1_2cta_v_bz_NNT         0.00%       0.000us         0.00%       0.000us       0.000us       6.221ms         0.24%       6.221ms       7.284us           854  
(EngineCore_7 pid=1389347) void at::native::_scatter_gather_elementwise_kernel<...         0.00%       0.000us         0.00%       0.000us       0.000us       6.201ms         0.24%       6.201ms       3.055us          2030  
(EngineCore_7 pid=1389347) void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       5.670ms         0.22%       5.670ms       2.793us          2030  
(EngineCore_7 pid=1389347)                 nvjet_tst_64x48_64x16_2x1_2cta_v_bz_TNT         0.00%       0.000us         0.00%       0.000us       0.000us       5.453ms         0.21%       5.453ms      14.899us           366  
(EngineCore_7 pid=1389347)                           _C::per_token_group_fp8_quant         0.15%       3.464ms         0.54%      12.225ms       5.726us       5.276ms         0.21%       5.276ms       2.471us          2135  
(EngineCore_7 pid=1389347) void per_token_group_quant_8bit_kernel<c10::BFloat16...         0.00%       0.000us         0.00%       0.000us       0.000us       5.276ms         0.21%       5.276ms       2.471us          2135  
(EngineCore_7 pid=1389347) void (anonymous namespace)::elementwise_kernel_with_...         0.00%       0.000us         0.00%       0.000us       0.000us       4.929ms         0.19%       4.929ms       1.214us          4060  
(EngineCore_7 pid=1389347) void at::native::sbtopk::gatherTopK<c10::BFloat16, u...         0.00%       0.000us         0.00%       0.000us       0.000us       4.894ms         0.19%       4.894ms       6.027us           812  
(EngineCore_7 pid=1389347) void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       4.770ms         0.19%       4.770ms       2.937us          1624  
(EngineCore_7 pid=1389347) void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       4.677ms         0.18%       4.677ms       2.288us          2044  
(EngineCore_7 pid=1389347)      triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_3         0.00%       0.000us         0.00%       0.000us       0.000us       4.368ms         0.17%       4.368ms       2.080us          2100  
(EngineCore_7 pid=1389347) void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       4.229ms         0.17%       4.229ms       2.083us          2030  
(EngineCore_7 pid=1389347) void at::native::_scatter_gather_elementwise_kernel<...         0.00%       0.000us         0.00%       0.000us       0.000us       4.181ms         0.16%       4.181ms       2.060us          2030  
(EngineCore_7 pid=1389347) void at::native::vectorized_elementwise_kernel<8, at...         0.00%       0.000us         0.00%       0.000us       0.000us       3.942ms         0.15%       3.942ms       1.942us          2030  
(EngineCore_7 pid=1389347) void vllm::concat_and_cache_mla_kernel<__nv_bfloat16...         0.00%       0.000us         0.00%       0.000us       0.000us       3.891ms         0.15%       3.891ms       2.278us          1708  
(EngineCore_7 pid=1389347) void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       3.874ms         0.15%       3.874ms       1.895us          2044  
(EngineCore_7 pid=1389347)                             triton_poi_fused_mul_silu_1         0.00%       0.000us         0.00%       0.000us       0.000us       3.820ms         0.15%       3.820ms       1.789us          2135  
(EngineCore_7 pid=1389347) triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_vie...         0.27%       6.177ms         0.34%       7.714ms      19.000us       3.717ms         0.15%       3.717ms       9.154us           406  
(EngineCore_7 pid=1389347)      triton_per_fused__to_copy_add_mean_mul_pow_rsqrt_5         0.00%       0.000us         0.00%       0.000us       0.000us       3.356ms         0.13%       3.356ms       1.598us          2100  
(EngineCore_7 pid=1389347)                               aten::bitwise_right_shift         0.32%       7.170ms         0.47%      10.667ms      13.137us       3.269ms         0.13%       3.269ms       4.025us           812  
(EngineCore_7 pid=1389347)                              triton_poi_fused_add_mul_6         0.00%       0.000us         0.00%       0.000us       0.000us       3.262ms         0.13%       3.262ms       1.553us          2100  
(EngineCore_7 pid=1389347) void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       3.232ms         0.13%       3.232ms       1.581us          2044  
(EngineCore_7 pid=1389347) void at::native::vectorized_elementwise_kernel<8, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.877ms         0.11%       2.877ms       1.417us          2030  
(EngineCore_7 pid=1389347)                                triton_poi_fused_zeros_7         0.00%       0.000us         0.00%       0.000us       0.000us       2.846ms         0.11%       2.846ms       1.355us          2100  
(EngineCore_7 pid=1389347)                                             aten::fill_         0.24%       5.455ms         0.50%      11.244ms       6.107us       2.762ms         0.11%       2.762ms       1.501us          1841  
(EngineCore_7 pid=1389347) void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.715ms         0.11%       2.715ms       1.338us          2030  
(EngineCore_7 pid=1389347)                                               aten::sum         0.28%       6.233ms         0.41%       9.219ms      11.353us       2.604ms         0.10%       2.604ms       3.207us           812  
(EngineCore_7 pid=1389347) void at::native::vectorized_elementwise_kernel<8, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.602ms         0.10%       2.602ms       1.059us          2457  
(EngineCore_7 pid=1389347) triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_vie...         0.21%       4.806ms         0.28%       6.280ms      15.468us       2.592ms         0.10%       2.592ms       6.383us           406  
(EngineCore_7 pid=1389347)                 nvjet_tst_64x48_64x16_2x1_2cta_v_bz_NNT         0.00%       0.000us         0.00%       0.000us       0.000us       2.550ms         0.10%       2.550ms       6.967us           366  
(EngineCore_7 pid=1389347)                          Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.421ms         0.09%       2.421ms       1.491us          1624  
(EngineCore_7 pid=1389347)                                      aten::floor_divide         0.12%       2.794ms         0.24%       5.471ms       6.737us       2.238ms         0.09%       2.238ms       2.756us           812  
(EngineCore_7 pid=1389347)                nvjet_tst_448x128_64x3_2x1_2cta_v_bz_TNT         0.00%       0.000us         0.00%       0.000us       0.000us       2.004ms         0.08%       2.004ms     333.953us             6  
(EngineCore_7 pid=1389347)                      nvjet_tst_256x24_64x6_2x1_v_bz_TNT         0.00%       0.000us         0.00%       0.000us       0.000us       1.987ms         0.08%       1.987ms       8.145us           244  
(EngineCore_7 pid=1389347)                 nvjet_tst_64x32_64x16_2x1_2cta_v_bz_TNT         0.00%       0.000us         0.00%       0.000us       0.000us       1.543ms         0.06%       1.543ms      12.646us           122  
(EngineCore_7 pid=1389347)                     nvjet_tst_128x24_64x11_1x1_v_bz_NNT         0.00%       0.000us         0.00%       0.000us       0.000us       1.429ms         0.06%       1.429ms       5.856us           244  
(EngineCore_7 pid=1389347)                                            aten::arange         0.17%       3.960ms         0.85%      19.264ms      11.862us       1.289ms         0.05%       2.581ms       1.589us          1624  
(EngineCore_7 pid=1389347)                                            aten::gather         0.12%       2.729ms         0.20%       4.462ms      10.989us       1.131ms         0.04%       1.131ms       2.785us           406  
(EngineCore_7 pid=1389347)                 nvjet_tst_64x16_64x16_2x1_2cta_v_bz_TNT         0.00%       0.000us         0.00%       0.000us       0.000us       1.112ms         0.04%       1.112ms       9.112us           122  
(EngineCore_7 pid=1389347)                         triton_poi_fused_add_copy_mul_4         0.21%       4.682ms         0.28%       6.230ms      14.834us       1.090ms         0.04%       1.090ms       2.595us           420  
(EngineCore_7 pid=1389347)                                          aten::scatter_         0.11%       2.526ms         0.21%       4.770ms      11.748us     997.480us         0.04%     997.480us       2.457us           406  
(EngineCore_7 pid=1389347)                                               aten::div         0.11%       2.573ms         0.18%       4.120ms      10.148us     953.871us         0.04%     953.871us       2.349us           406  
(EngineCore_7 pid=1389347)      triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_3         0.19%       4.403ms         0.26%       5.922ms      14.100us     938.443us         0.04%     938.443us       2.234us           420  
(EngineCore_7 pid=1389347)                                           aten::sigmoid         0.12%       2.638ms         0.20%       4.598ms      11.325us     799.426us         0.03%     799.426us       1.969us           406  
(EngineCore_7 pid=1389347)                 nvjet_tst_64x32_64x16_2x1_2cta_v_bz_NNT         0.00%       0.000us         0.00%       0.000us       0.000us     778.206us         0.03%     778.206us       6.379us           122  
(EngineCore_7 pid=1389347)                                               aten::add         0.11%       2.454ms         0.17%       3.958ms       9.750us     763.690us         0.03%     763.690us       1.881us           406  
(EngineCore_7 pid=1389347) void at::native::vectorized_elementwise_kernel<8, at...         0.00%       0.000us         0.00%       0.000us       0.000us     763.690us         0.03%     763.690us       1.881us           406  
(EngineCore_7 pid=1389347)                 nvjet_tst_64x16_64x16_2x1_2cta_v_bz_NNT         0.00%       0.000us         0.00%       0.000us       0.000us     737.248us         0.03%     737.248us       6.043us           122  
(EngineCore_7 pid=1389347)      triton_per_fused__to_copy_add_mean_mul_pow_rsqrt_5         0.15%       3.498ms         0.22%       4.951ms      11.788us     692.547us         0.03%     692.547us       1.649us           420  
(EngineCore_7 pid=1389347)                             triton_poi_fused_mul_silu_1         0.22%       5.056ms         0.29%       6.625ms      15.515us     686.611us         0.03%     686.611us       1.608us           427  
(EngineCore_7 pid=1389347)                 nvjet_tst_448x96_64x3_2x1_2cta_v_bz_TNT         0.00%       0.000us         0.00%       0.000us       0.000us     676.226us         0.03%     676.226us     338.113us             2  
(EngineCore_7 pid=1389347)                              triton_poi_fused_add_mul_6         0.14%       3.140ms         0.20%       4.541ms      10.812us     650.181us         0.03%     650.181us       1.548us           420  
(EngineCore_7 pid=1389347)                                   Lazy Function Loading         0.00%      82.974us         0.00%      82.974us      41.487us     646.433us         0.03%     646.433us     323.217us             2  
(EngineCore_7 pid=1389347)                      nvjet_tst_448x40_64x3_2x1_v_bz_TNT         0.00%       0.000us         0.00%       0.000us       0.000us     633.729us         0.02%     633.729us     316.865us             2  
(EngineCore_7 pid=1389347)                                       aten::bitwise_not         0.09%       2.014ms         0.15%       3.485ms       8.584us     629.661us         0.02%     629.661us       1.551us           406  
(EngineCore_7 pid=1389347)      triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_2         0.00%       0.000us         0.00%       0.000us       0.000us     619.495us         0.02%     619.495us       5.900us           105  
(EngineCore_7 pid=1389347)                                      aten::masked_fill_         0.08%       1.781ms         0.15%       3.350ms       8.250us     599.040us         0.02%     599.040us       1.475us           406  
(EngineCore_7 pid=1389347)                                triton_poi_fused_zeros_7         0.11%       2.451ms         0.17%       3.880ms       9.238us     586.674us         0.02%     586.674us       1.397us           420  
(EngineCore_7 pid=1389347)      triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_0         0.00%       0.000us         0.00%       0.000us       0.000us     576.255us         0.02%     576.255us       5.488us           105  
(EngineCore_7 pid=1389347)                                            aten::argmax         0.01%     171.307us         0.01%     246.102us      17.579us     547.074us         0.02%     547.074us      39.077us            14  
(EngineCore_7 pid=1389347) void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us     547.074us         0.02%     547.074us      39.077us            14  
(EngineCore_7 pid=1389347)                nvjet_tst_448x112_64x3_2x1_2cta_v_bz_TNT         0.00%       0.000us         0.00%       0.000us       0.000us     345.184us         0.01%     345.184us     345.184us             1  
(EngineCore_7 pid=1389347)                 nvjet_tst_448x64_64x3_2x1_2cta_v_bz_TNT         0.00%       0.000us         0.00%       0.000us       0.000us     330.336us         0.01%     330.336us     330.336us             1  
(EngineCore_7 pid=1389347) -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
(EngineCore_7 pid=1389347) Self CPU time total: 2.265s
(EngineCore_7 pid=1389347) Self CUDA time total: 2.551s
(EngineCore_7 pid=1389347) 
(EngineCore_6 pid=1389346) -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
(EngineCore_6 pid=1389346)                                                    Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
(EngineCore_6 pid=1389346) -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
(EngineCore_6 pid=1389346) void deep_ep::internode_ll::dispatch<true, false, 71...         0.00%       0.000us         0.00%       0.000us       0.000us     944.589ms        35.52%     944.589ms     465.315us          2030  
(EngineCore_6 pid=1389346)                                       vllm::moe_forward        12.16%     275.185ms        23.44%     530.506ms       1.307ms     834.266ms        31.37%     918.100ms       2.261ms           406  
(EngineCore_6 pid=1389346) void deep_ep::internode_ll::combine<false, 7168, 9>(...         0.00%       0.000us         0.00%       0.000us       0.000us     486.631ms        18.30%     486.631ms     239.720us          2030  
(EngineCore_6 pid=1389346) void at::native::(anonymous namespace)::indexSelectL...         0.00%       0.000us         0.00%       0.000us       0.000us     204.831ms         7.70%     204.831ms      50.451us          4060  
(EngineCore_6 pid=1389346) void cutlass::device_kernel<vllm::cutlass_3x_gemm_fp...         0.00%       0.000us         0.00%       0.000us       0.000us     166.753ms         6.27%     166.753ms      24.711us          6748  
(EngineCore_6 pid=1389346) void cutlass::device_kernel<vllm::cutlass_3x_gemm_fp...         0.00%       0.000us         0.00%       0.000us       0.000us      77.489ms         2.91%      77.489ms      43.242us          1792  
(EngineCore_6 pid=1389346) void deep_gemm::sm100_fp8_gemm_1d1d_impl<(cute::UMMA...         0.00%       0.000us         0.00%       0.000us       0.000us      76.771ms         2.89%      76.771ms      37.818us          2030  
(EngineCore_6 pid=1389346)                           _silu_mul_fp8_quant_deep_gemm         0.00%       0.000us         0.00%       0.000us       0.000us      76.077ms         2.86%      76.077ms      37.476us          2030  
(EngineCore_6 pid=1389346) void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us      52.099ms         1.96%      52.099ms      12.832us          4060  
(EngineCore_6 pid=1389346) void cutlass::device_kernel<cutlass::fmha::kernel::S...         0.00%       0.000us         0.00%       0.000us       0.000us      46.423ms         1.75%      46.423ms      27.180us          1708  
(EngineCore_6 pid=1389346)                                   _C::cutlass_scaled_mm         0.52%      11.819ms         1.15%      26.070ms      12.211us      35.752ms         1.34%      35.752ms      16.746us          2135  
(EngineCore_6 pid=1389346) void cutlass::device_kernel<vllm::cutlass_3x_gemm_fp...         0.00%       0.000us         0.00%       0.000us       0.000us      35.752ms         1.34%      35.752ms      16.746us          2135  
(EngineCore_6 pid=1389346) void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us      35.680ms         1.34%      35.680ms       8.788us          4060  
(EngineCore_6 pid=1389346) void per_token_group_quant_8bit_kernel<c10::BFloat16...         0.00%       0.000us         0.00%       0.000us       0.000us      33.609ms         1.26%      33.609ms       3.935us          8540  
(EngineCore_6 pid=1389346) void deep_gemm::sm100_fp8_gemm_1d1d_impl<(cute::UMMA...         0.00%       0.000us         0.00%       0.000us       0.000us      32.813ms         1.23%      32.813ms      16.164us          2030  
(EngineCore_6 pid=1389346)                                      aten::index_select         0.22%       4.991ms         0.47%      10.616ms      12.853us      32.585ms         1.23%      32.585ms      39.449us           826  
(EngineCore_6 pid=1389346) void at::native::sbtopk::gatherTopK<c10::BFloat16, u...         0.00%       0.000us         0.00%       0.000us       0.000us      30.784ms         1.16%      30.784ms       5.832us          5278  
(EngineCore_6 pid=1389346) void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us      29.824ms         1.12%      29.824ms       7.346us          4060  
(EngineCore_6 pid=1389346)                                             aten::copy_         0.91%      20.613ms        49.48%        1.120s     132.656us      24.307ms         0.91%      24.309ms       2.879us          8442  
(EngineCore_6 pid=1389346) void deep_gemm::transpose_and_pack_fp32_into_ue8m0<5...         0.00%       0.000us         0.00%       0.000us       0.000us      21.013ms         0.79%      21.013ms      10.351us          2030  
(EngineCore_6 pid=1389346) void deep_gemm::transpose_and_pack_fp32_into_ue8m0<5...         0.00%       0.000us         0.00%       0.000us       0.000us      20.673ms         0.78%      20.673ms      10.184us          2030  
(EngineCore_6 pid=1389346) triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_vie...         0.00%       0.000us         0.00%       0.000us       0.000us      17.946ms         0.67%      17.946ms       8.840us          2030  
(EngineCore_6 pid=1389346) void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us      16.583ms         0.62%      16.583ms       3.045us          5446  
(EngineCore_6 pid=1389346) void at::native::elementwise_kernel<128, 2, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us      15.980ms         0.60%      15.980ms       3.936us          4060  
(EngineCore_6 pid=1389346)                       nvjet_tst_8x64_64x16_4x1_v_bz_TNN         0.00%       0.000us         0.00%       0.000us       0.000us      13.338ms         0.50%      13.338ms      10.950us          1218  
(EngineCore_6 pid=1389346)                 nvjet_tst_64x64_64x16_2x1_2cta_v_bz_TNT         0.00%       0.000us         0.00%       0.000us       0.000us      13.068ms         0.49%      13.068ms      15.302us           854  
(EngineCore_6 pid=1389346)                                           memcpy32_post         0.00%       0.000us         0.00%       0.000us       0.000us      12.819ms         0.48%      12.819ms       1.868us          6862  
(EngineCore_6 pid=1389346) triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_vie...         0.00%       0.000us         0.00%       0.000us       0.000us      11.546ms         0.43%      11.546ms       5.688us          2030  
(EngineCore_6 pid=1389346) void at::native::vectorized_elementwise_kernel<2, at...         0.00%       0.000us         0.00%       0.000us       0.000us      10.756ms         0.40%      10.756ms       2.649us          4060  
(EngineCore_6 pid=1389346)                         triton_poi_fused_add_copy_mul_4         0.00%       0.000us         0.00%       0.000us       0.000us      10.540ms         0.40%      10.540ms       5.019us          2100  
(EngineCore_6 pid=1389346) void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us      10.233ms         0.38%      10.233ms       2.520us          4060  
(EngineCore_6 pid=1389346) void at::native::bitonicSortKVInPlace<2, -1, 16, 16,...         0.00%       0.000us         0.00%       0.000us       0.000us      10.118ms         0.38%      10.118ms       4.984us          2030  
(EngineCore_6 pid=1389346)                                              aten::topk         0.46%      10.489ms         0.86%      19.496ms      16.006us       9.677ms         0.36%       9.677ms       7.945us          1218  
(EngineCore_6 pid=1389346)                                                aten::mm         0.89%      20.171ms         7.00%     158.382ms     377.099us       9.286ms         0.35%       9.958ms      23.710us           420  
(EngineCore_6 pid=1389346)                       nvjet_tst_64x8_64x16_2x4_v_bz_TNT         0.00%       0.000us         0.00%       0.000us       0.000us       8.724ms         0.33%       8.724ms      10.744us           812  
(EngineCore_6 pid=1389346) void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       7.180ms         0.27%       7.180ms       1.769us          4060  
(EngineCore_6 pid=1389346)                                               memcpy128         0.00%       0.000us         0.00%       0.000us       0.000us       7.110ms         0.27%       7.110ms       2.331us          3050  
(EngineCore_6 pid=1389346) void at::native::_scatter_gather_elementwise_kernel<...         0.00%       0.000us         0.00%       0.000us       0.000us       6.538ms         0.25%       6.538ms       3.221us          2030  
(EngineCore_6 pid=1389346)                 nvjet_tst_64x64_64x16_2x1_2cta_v_bz_NNT         0.00%       0.000us         0.00%       0.000us       0.000us       6.098ms         0.23%       6.098ms       7.141us           854  
(EngineCore_6 pid=1389346) void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       5.831ms         0.22%       5.831ms       2.872us          2030  
(EngineCore_6 pid=1389346)                 nvjet_tst_64x48_64x16_2x1_2cta_v_bz_TNT         0.00%       0.000us         0.00%       0.000us       0.000us       5.703ms         0.21%       5.703ms      15.583us           366  
(EngineCore_6 pid=1389346)                           _C::per_token_group_fp8_quant         0.15%       3.472ms         0.54%      12.274ms       5.749us       5.332ms         0.20%       5.332ms       2.498us          2135  
(EngineCore_6 pid=1389346) void per_token_group_quant_8bit_kernel<c10::BFloat16...         0.00%       0.000us         0.00%       0.000us       0.000us       5.332ms         0.20%       5.332ms       2.498us          2135  
(EngineCore_6 pid=1389346) void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       5.204ms         0.20%       5.204ms       2.546us          2044  
(EngineCore_6 pid=1389346) void at::native::sbtopk::gatherTopK<c10::BFloat16, u...         0.00%       0.000us         0.00%       0.000us       0.000us       5.107ms         0.19%       5.107ms       6.290us           812  
(EngineCore_6 pid=1389346) void (anonymous namespace)::elementwise_kernel_with_...         0.00%       0.000us         0.00%       0.000us       0.000us       4.894ms         0.18%       4.894ms       1.205us          4060  
(EngineCore_6 pid=1389346) void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       4.804ms         0.18%       4.804ms       2.958us          1624  
(EngineCore_6 pid=1389346)      triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_3         0.00%       0.000us         0.00%       0.000us       0.000us       4.460ms         0.17%       4.460ms       2.124us          2100  
(EngineCore_6 pid=1389346) void at::native::_scatter_gather_elementwise_kernel<...         0.00%       0.000us         0.00%       0.000us       0.000us       4.455ms         0.17%       4.455ms       2.195us          2030  
(EngineCore_6 pid=1389346) void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       4.422ms         0.17%       4.422ms       2.178us          2030  
(EngineCore_6 pid=1389346) void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       4.238ms         0.16%       4.238ms       2.073us          2044  
(EngineCore_6 pid=1389346) void at::native::vectorized_elementwise_kernel<8, at...         0.00%       0.000us         0.00%       0.000us       0.000us       3.950ms         0.15%       3.950ms       1.946us          2030  
(EngineCore_6 pid=1389346)                             triton_poi_fused_mul_silu_1         0.00%       0.000us         0.00%       0.000us       0.000us       3.898ms         0.15%       3.898ms       1.826us          2135  
(EngineCore_6 pid=1389346) triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_vie...         0.28%       6.290ms         0.35%       7.824ms      19.271us       3.862ms         0.15%       3.862ms       9.511us           406  
(EngineCore_6 pid=1389346) void vllm::concat_and_cache_mla_kernel<__nv_bfloat16...         0.00%       0.000us         0.00%       0.000us       0.000us       3.829ms         0.14%       3.829ms       2.242us          1708  
(EngineCore_6 pid=1389346) void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       3.440ms         0.13%       3.440ms       1.683us          2044  
(EngineCore_6 pid=1389346)      triton_per_fused__to_copy_add_mean_mul_pow_rsqrt_5         0.00%       0.000us         0.00%       0.000us       0.000us       3.384ms         0.13%       3.384ms       1.611us          2100  
(EngineCore_6 pid=1389346)                               aten::bitwise_right_shift         0.32%       7.234ms         0.48%      10.805ms      13.307us       3.341ms         0.13%       3.341ms       4.114us           812  
(EngineCore_6 pid=1389346)                              triton_poi_fused_add_mul_6         0.00%       0.000us         0.00%       0.000us       0.000us       3.076ms         0.12%       3.076ms       1.465us          2100  
(EngineCore_6 pid=1389346) void at::native::vectorized_elementwise_kernel<8, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.947ms         0.11%       2.947ms       1.452us          2030  
(EngineCore_6 pid=1389346) void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.913ms         0.11%       2.913ms       1.435us          2030  
(EngineCore_6 pid=1389346)                                triton_poi_fused_zeros_7         0.00%       0.000us         0.00%       0.000us       0.000us       2.869ms         0.11%       2.869ms       1.366us          2100  
(EngineCore_6 pid=1389346)                                               aten::sum         0.28%       6.252ms         0.43%       9.675ms      11.915us       2.866ms         0.11%       2.866ms       3.529us           812  
(EngineCore_6 pid=1389346)                                             aten::fill_         0.24%       5.530ms         0.51%      11.605ms       6.304us       2.779ms         0.10%       2.779ms       1.509us          1841  
(EngineCore_6 pid=1389346) triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_vie...         0.22%       4.921ms         0.29%       6.472ms      15.942us       2.751ms         0.10%       2.751ms       6.776us           406  
(EngineCore_6 pid=1389346) void at::native::vectorized_elementwise_kernel<8, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.728ms         0.10%       2.728ms       1.110us          2457  
(EngineCore_6 pid=1389346)                 nvjet_tst_64x48_64x16_2x1_2cta_v_bz_NNT         0.00%       0.000us         0.00%       0.000us       0.000us       2.619ms         0.10%       2.619ms       7.157us           366  
(EngineCore_6 pid=1389346)                          Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.596ms         0.10%       2.596ms       1.598us          1624  
(EngineCore_6 pid=1389346)                                      aten::floor_divide         0.12%       2.826ms         0.24%       5.527ms       6.807us       2.339ms         0.09%       2.339ms       2.881us           812  
(EngineCore_6 pid=1389346)                nvjet_tst_448x128_64x3_2x1_2cta_v_bz_TNT         0.00%       0.000us         0.00%       0.000us       0.000us       2.043ms         0.08%       2.043ms     340.497us             6  
(EngineCore_6 pid=1389346)                      nvjet_tst_256x24_64x6_2x1_v_bz_TNT         0.00%       0.000us         0.00%       0.000us       0.000us       2.011ms         0.08%       2.011ms       8.243us           244  
(EngineCore_6 pid=1389346)                 nvjet_tst_64x32_64x16_2x1_2cta_v_bz_TNT         0.00%       0.000us         0.00%       0.000us       0.000us       1.619ms         0.06%       1.619ms      13.274us           122  
(EngineCore_6 pid=1389346)                     nvjet_tst_128x24_64x11_1x1_v_bz_NNT         0.00%       0.000us         0.00%       0.000us       0.000us       1.428ms         0.05%       1.428ms       5.854us           244  
(EngineCore_6 pid=1389346)                                            aten::arange         0.18%       4.014ms         0.86%      19.378ms      11.932us       1.279ms         0.05%       2.561ms       1.577us          1624  
(EngineCore_6 pid=1389346)                                            aten::gather         0.12%       2.773ms         0.20%       4.573ms      11.264us       1.142ms         0.04%       1.142ms       2.814us           406  
(EngineCore_6 pid=1389346)                         triton_poi_fused_add_copy_mul_4         0.20%       4.629ms         0.28%       6.259ms      14.903us       1.098ms         0.04%       1.098ms       2.614us           420  
(EngineCore_6 pid=1389346)                 nvjet_tst_64x16_64x16_2x1_2cta_v_bz_TNT         0.00%       0.000us         0.00%       0.000us       0.000us       1.095ms         0.04%       1.095ms       8.975us           122  
(EngineCore_6 pid=1389346)                                          aten::scatter_         0.11%       2.461ms         0.21%       4.736ms      11.665us       1.038ms         0.04%       1.040ms       2.561us           406  
(EngineCore_6 pid=1389346)      triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_3         0.19%       4.273ms         0.26%       5.838ms      13.901us       1.029ms         0.04%       1.029ms       2.449us           420  
(EngineCore_6 pid=1389346)                                               aten::div         0.11%       2.562ms         0.19%       4.231ms      10.420us     915.400us         0.03%     915.400us       2.255us           406  
(EngineCore_6 pid=1389346)                                           aten::sigmoid         0.12%       2.721ms         0.21%       4.744ms      11.684us     818.785us         0.03%     818.785us       2.017us           406  
(EngineCore_6 pid=1389346)                                               aten::add         0.10%       2.348ms         0.17%       3.938ms       9.700us     775.226us         0.03%     775.226us       1.909us           406  
(EngineCore_6 pid=1389346) void at::native::vectorized_elementwise_kernel<8, at...         0.00%       0.000us         0.00%       0.000us       0.000us     775.226us         0.03%     775.226us       1.909us           406  
(EngineCore_6 pid=1389346)                 nvjet_tst_64x32_64x16_2x1_2cta_v_bz_NNT         0.00%       0.000us         0.00%       0.000us       0.000us     759.274us         0.03%     759.274us       6.224us           122  
(EngineCore_6 pid=1389346)                             triton_poi_fused_mul_silu_1         0.23%       5.233ms         0.30%       6.819ms      15.970us     758.933us         0.03%     758.933us       1.777us           427  
(EngineCore_6 pid=1389346)                 nvjet_tst_64x16_64x16_2x1_2cta_v_bz_NNT         0.00%       0.000us         0.00%       0.000us       0.000us     735.618us         0.03%     735.618us       6.030us           122  
(EngineCore_6 pid=1389346)      triton_per_fused__to_copy_add_mean_mul_pow_rsqrt_5         0.16%       3.682ms         0.23%       5.197ms      12.373us     721.258us         0.03%     721.258us       1.717us           420  
(EngineCore_6 pid=1389346)                                       aten::bitwise_not         0.10%       2.173ms         0.16%       3.691ms       9.090us     707.265us         0.03%     707.265us       1.742us           406  
(EngineCore_6 pid=1389346)                                      aten::masked_fill_         0.08%       1.768ms         0.15%       3.479ms       8.570us     696.702us         0.03%     696.702us       1.716us           406  
(EngineCore_6 pid=1389346)                      nvjet_tst_448x40_64x3_2x1_v_bz_TNT         0.00%       0.000us         0.00%       0.000us       0.000us     678.818us         0.03%     678.818us     339.409us             2  
(EngineCore_6 pid=1389346)                                   Lazy Function Loading         0.00%      76.234us         0.00%      76.234us      38.117us     672.578us         0.03%     672.578us     336.289us             2  
(EngineCore_6 pid=1389346)                              triton_poi_fused_add_mul_6         0.14%       3.143ms         0.21%       4.651ms      11.073us     665.718us         0.03%     665.718us       1.585us           420  
(EngineCore_6 pid=1389346)                 nvjet_tst_448x96_64x3_2x1_2cta_v_bz_TNT         0.00%       0.000us         0.00%       0.000us       0.000us     640.673us         0.02%     640.673us     320.336us             2  
(EngineCore_6 pid=1389346)                                triton_poi_fused_zeros_7         0.11%       2.469ms         0.17%       3.960ms       9.428us     613.474us         0.02%     613.474us       1.461us           420  
(EngineCore_6 pid=1389346)      triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_2         0.00%       0.000us         0.00%       0.000us       0.000us     612.643us         0.02%     612.643us       5.835us           105  
(EngineCore_6 pid=1389346)      triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_0         0.00%       0.000us         0.00%       0.000us       0.000us     578.661us         0.02%     578.661us       5.511us           105  
(EngineCore_6 pid=1389346)                                            aten::argmax         0.01%     151.002us         0.01%     222.538us      15.896us     538.562us         0.02%     538.562us      38.469us            14  
(EngineCore_6 pid=1389346) void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us     538.562us         0.02%     538.562us      38.469us            14  
(EngineCore_6 pid=1389346)                nvjet_tst_448x112_64x3_2x1_2cta_v_bz_TNT         0.00%       0.000us         0.00%       0.000us       0.000us     356.385us         0.01%     356.385us     356.385us             1  
(EngineCore_6 pid=1389346)                 nvjet_tst_448x64_64x3_2x1_2cta_v_bz_TNT         0.00%       0.000us         0.00%       0.000us       0.000us     350.625us         0.01%     350.625us     350.625us             1  
(EngineCore_6 p

Signed-off-by: yewentao256 <zhyanwentao@126.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Aug 21, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to reduce log verbosity by ensuring that profiler results are only printed on rank 0. The changes correctly implement this by adding a conditional check for self.rank == 0 before printing the profiler summary table in both vllm/v1/worker/gpu_worker.py and vllm/worker/worker.py. This effectively suppresses redundant output from other ranks. The implementation is correct and achieves the intended purpose.

Copy link
Member

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better to change this to print only if the local_rank is 0?

That way at least every pod will print the profiler results, with no risk of the ranks clobbering each other

@ProExpertProg ProExpertProg enabled auto-merge (squash) August 27, 2025 21:59
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 27, 2025
@ProExpertProg ProExpertProg disabled auto-merge August 27, 2025 22:00
@ProExpertProg
Copy link
Collaborator

Did not see Tyler's comment, I agree with what he said

Signed-off-by: yewentao256 <zhyanwentao@126.com>
@yewentao256
Copy link
Member Author

Maybe better to change this to print only if the local_rank is 0?

That way at least every pod will print the profiler results, with no risk of the ranks clobbering each other

Fixed, thanks! @tlrmchlsmth

@yewentao256 yewentao256 enabled auto-merge (squash) August 28, 2025 15:24
@yewentao256 yewentao256 merged commit 98aee61 into vllm-project:main Sep 2, 2025
38 checks passed
@yewentao256 yewentao256 deleted the wye-only-print-cuda-profile-time-on-rank-0 branch September 2, 2025 21:46
845473182 pushed a commit to 845473182/vllm that referenced this pull request Sep 3, 2025
* 'main' of https://github.com/845473182/vllm: (457 commits)
  [BugFix] Fix routed_scaling_factor double mul for dots1 and glm4 MoE models (vllm-project#24132)
  [Misc] Add check for dual_chunk_attention (vllm-project#24070)
  [Doc]: fix typos in Python comments (vllm-project#24115)
  [Doc]: fix typos in Python comments (vllm-project#24093)
  [Compile] Fix Compile Warning for `w4a8_mm_entry.cu` (vllm-project#23660)
  fix some typos (vllm-project#24071)
  [V1] Wrapper which plumbs request-level logits processors into vLLM batch-level logits processing (vllm-project#23656)
  Upgrade xgrammar to 0.1.23 (vllm-project#22988)
  Update release pipeline post PyTorch 2.8.0 update (vllm-project#24073)
  [XPU] Fix the bug of LoRA logits on the XPU platform (vllm-project#24081)
  [CI/Build] Disable SiluMul NVFP4 quant fusion tests (vllm-project#24121)
  [Bug] R1 Accuracy: Fix `routed_scaling_factor` Double Mul Issue (vllm-project#24119)
  [AMD][Kernel][Bugfix] Cast offsets tensor bn to tl.int64 to avoid GPU segfault (vllm-project#23692)
  [CI] Enable all hf transformers baselines in test_hybrid (vllm-project#23936)
  [Log] Only Print Profiler Results on Rank 0 (vllm-project#23370)
  Fix weights loading for Apertus (vllm-project#24100)
  [Metrics] Deprecate TPOT in favor of ITL (vllm-project#24110)
  [Bugfix] Fix packed_factor missing attribute error (vllm-project#23902)
  Run ruff format on a few files. (vllm-project#24075)
  [Bugfix] Fix transform_config parsing in Compressed Tensors (vllm-project#23945)
  ...
eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025
Signed-off-by: yewentao256 <zhyanwentao@126.com>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants