Skip to content

deps: bump tokenspeed-trtllm-kernel to 1.3.0rc15.post20260522+full#227

Open
aaronliuls wants to merge 1 commit into
mainfrom
numerics/trtllm-kernel-v1.3.0rc15
Open

deps: bump tokenspeed-trtllm-kernel to 1.3.0rc15.post20260522+full#227
aaronliuls wants to merge 1 commit into
mainfrom
numerics/trtllm-kernel-v1.3.0rc15

Conversation

@aaronliuls
Copy link
Copy Markdown
Contributor

Summary

  • Bump pin from tokenspeed-trtllm-kernel==1.2.1.post20260427 (lite) → ==1.3.0rc15.post20260522+full (full). Aligns with NVIDIA's official TensorRT-LLM v1.3.0rc15 release.
  • Fix fp8_blockwise_scaled_mm wrapper: upstream trtllm::fp8_block_scaling_gemm_impl dropped alpha + out_dtype (now hardcoded inside kernel, dtype derived). Wrapper now uses the 4-arg form + post-cast.
  • Other 7 wrappers match rc15 schemas unchanged.

Upstream blockers found + addressed (in tokenspeed-trtllm-kernel)

  • rc15 requires PTX ISA 9.1 (cvt.e4m3x2.bf16x2 in moeAlltoAllKernels.cu). CUDA 13.0 ptxas rejects. Toolchain bumped to CUDA 13.1 (matches nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc15).
  • kernelParams.h deleted upstream → struct moved to trtllmGen_fmha_export/KernelParams.h. Ported FMHA stride override + K/V pointer fallback to new locations.
  • TllmGenFmhaRunner constructor split dtypeKvdtypeK/dtypeV. fmhaRunnerOp.cpp wrapper updated.
  • nlohmann/json now a hard dep (FmhaOptions::toJson). Added to BUILD_OPS_ONLY's FetchContent + include_directories.

Sister branches:

  • lightseekorg/tokenspeed-trtllm-kernel@upgrade/v1.3.0rc15 (patches + Dockerfile)
  • lightseekorg/tokenspeed-third-party@ci/trtllm-v1.3.0rc15 (workflow CUDA 13.1 + lite disabled)

Wheel built by workflow run #26304084690 (build steps green; release-publish step failed on GH_TOKEN scope — wheels available as artifacts).

Test plan

  • ut-tokenspeed-kernel matrix green on h100/b200/b300/gb200 (mi355 should skip cleanly — trtllm is nvidia-only)
  • test_numerics.py covers 7 trtllm-backed registrations (3 gemm + 1 moe + 3 quantize) — must stay green
  • Verify the wheel actually installs in CI (release-publish on lightseek-bot/tmp is currently blocked on token scope)

Aligns with NVIDIA's official TensorRT-LLM v1.3.0rc15 release. Switches the
runtime from the lite wheel (no local segment) to the full wheel (+full
local segment) — full ships the complete 150-op surface vs lite's 12-op
profile.

The trtllm op signature for fp8_block_scaling_gemm_impl changed upstream:
the alpha and out_dtype parameters were dropped (the kernel hardcodes
alpha=1.0 and selects output dtype internally — bf16 on Blackwell). The
fp8_blockwise_scaled_mm wrapper now calls the 4-arg form and casts the
result to the caller's requested dtype only when it differs.

Other wrappers (dsv3_fused_a_gemm, per_token/per_tensor quant fp8,
per_token_group_quant_8bit, moe_align_block_size, fast_topk_v2) match the
rc15 schemas without changes.

Verified locally on B200/SM100 with CUDA 13.1 (matches the upstream
nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc15 toolchain).

Signed-off-by: aaronliuls <aaron@lightseek.org>
@aaronliuls aaronliuls requested a review from a team as a code owner May 23, 2026 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant