deps: bump tokenspeed-trtllm-kernel to 1.3.0rc15.post20260522+full by aaronliuls · Pull Request #227 · lightseekorg/tokenspeed

aaronliuls · 2026-05-23T10:44:39Z

Summary

Bump pin from tokenspeed-trtllm-kernel==1.2.1.post20260427 (lite) → ==1.3.0rc15.post20260522+full (full). Aligns with NVIDIA's official TensorRT-LLM v1.3.0rc15 release.
Fix fp8_blockwise_scaled_mm wrapper: upstream trtllm::fp8_block_scaling_gemm_impl dropped alpha + out_dtype (now hardcoded inside kernel, dtype derived). Wrapper now uses the 4-arg form + post-cast.
Other 7 wrappers match rc15 schemas unchanged.

Upstream blockers found + addressed (in tokenspeed-trtllm-kernel)

rc15 requires PTX ISA 9.1 (cvt.e4m3x2.bf16x2 in moeAlltoAllKernels.cu). CUDA 13.0 ptxas rejects. Toolchain bumped to CUDA 13.1 (matches nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc15).
kernelParams.h deleted upstream → struct moved to trtllmGen_fmha_export/KernelParams.h. Ported FMHA stride override + K/V pointer fallback to new locations.
TllmGenFmhaRunner constructor split dtypeKv → dtypeK/dtypeV. fmhaRunnerOp.cpp wrapper updated.
nlohmann/json now a hard dep (FmhaOptions::toJson). Added to BUILD_OPS_ONLY's FetchContent + include_directories.

Sister branches:

lightseekorg/tokenspeed-trtllm-kernel@upgrade/v1.3.0rc15 (patches + Dockerfile)
lightseekorg/tokenspeed-third-party@ci/trtllm-v1.3.0rc15 (workflow CUDA 13.1 + lite disabled)

Wheel built by workflow run #26304084690 (build steps green; release-publish step failed on GH_TOKEN scope — wheels available as artifacts).

Test plan

ut-tokenspeed-kernel matrix green on h100/b200/b300/gb200 (mi355 should skip cleanly — trtllm is nvidia-only)
test_numerics.py covers 7 trtllm-backed registrations (3 gemm + 1 moe + 3 quantize) — must stay green
Verify the wheel actually installs in CI (release-publish on lightseek-bot/tmp is currently blocked on token scope)

Aligns with NVIDIA's official TensorRT-LLM v1.3.0rc15 release. Switches the runtime from the lite wheel (no local segment) to the full wheel (+full local segment) — full ships the complete 150-op surface vs lite's 12-op profile. The trtllm op signature for fp8_block_scaling_gemm_impl changed upstream: the alpha and out_dtype parameters were dropped (the kernel hardcodes alpha=1.0 and selects output dtype internally — bf16 on Blackwell). The fp8_blockwise_scaled_mm wrapper now calls the 4-arg form and casts the result to the caller's requested dtype only when it differs. Other wrappers (dsv3_fused_a_gemm, per_token/per_tensor quant fp8, per_token_group_quant_8bit, moe_align_block_size, fast_topk_v2) match the rc15 schemas without changes. Verified locally on B200/SM100 with CUDA 13.1 (matches the upstream nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc15 toolchain). Signed-off-by: aaronliuls <aaron@lightseek.org>

aaronliuls requested a review from a team as a code owner May 23, 2026 10:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deps: bump tokenspeed-trtllm-kernel to 1.3.0rc15.post20260522+full#227

deps: bump tokenspeed-trtllm-kernel to 1.3.0rc15.post20260522+full#227
aaronliuls wants to merge 1 commit into
mainfrom
numerics/trtllm-kernel-v1.3.0rc15

aaronliuls commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aaronliuls commented May 23, 2026

Summary

Upstream blockers found + addressed (in tokenspeed-trtllm-kernel)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant