deps: bump tokenspeed-trtllm-kernel to 1.3.0rc15.post20260522+full#227
Open
aaronliuls wants to merge 1 commit into
Open
deps: bump tokenspeed-trtllm-kernel to 1.3.0rc15.post20260522+full#227aaronliuls wants to merge 1 commit into
aaronliuls wants to merge 1 commit into
Conversation
Aligns with NVIDIA's official TensorRT-LLM v1.3.0rc15 release. Switches the runtime from the lite wheel (no local segment) to the full wheel (+full local segment) — full ships the complete 150-op surface vs lite's 12-op profile. The trtllm op signature for fp8_block_scaling_gemm_impl changed upstream: the alpha and out_dtype parameters were dropped (the kernel hardcodes alpha=1.0 and selects output dtype internally — bf16 on Blackwell). The fp8_blockwise_scaled_mm wrapper now calls the 4-arg form and casts the result to the caller's requested dtype only when it differs. Other wrappers (dsv3_fused_a_gemm, per_token/per_tensor quant fp8, per_token_group_quant_8bit, moe_align_block_size, fast_topk_v2) match the rc15 schemas without changes. Verified locally on B200/SM100 with CUDA 13.1 (matches the upstream nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc15 toolchain). Signed-off-by: aaronliuls <aaron@lightseek.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tokenspeed-trtllm-kernel==1.2.1.post20260427(lite) →==1.3.0rc15.post20260522+full(full). Aligns with NVIDIA's official TensorRT-LLM v1.3.0rc15 release.fp8_blockwise_scaled_mmwrapper: upstreamtrtllm::fp8_block_scaling_gemm_impldroppedalpha+out_dtype(now hardcoded inside kernel, dtype derived). Wrapper now uses the 4-arg form + post-cast.Upstream blockers found + addressed (in tokenspeed-trtllm-kernel)
cvt.e4m3x2.bf16x2inmoeAlltoAllKernels.cu). CUDA 13.0 ptxas rejects. Toolchain bumped to CUDA 13.1 (matchesnvcr.io/nvidia/tensorrt-llm/release:1.3.0rc15).kernelParams.hdeleted upstream → struct moved totrtllmGen_fmha_export/KernelParams.h. Ported FMHA stride override + K/V pointer fallback to new locations.TllmGenFmhaRunnerconstructor splitdtypeKv→dtypeK/dtypeV.fmhaRunnerOp.cppwrapper updated.nlohmann/jsonnow a hard dep (FmhaOptions::toJson). Added to BUILD_OPS_ONLY's FetchContent + include_directories.Sister branches:
lightseekorg/tokenspeed-trtllm-kernel@upgrade/v1.3.0rc15(patches + Dockerfile)lightseekorg/tokenspeed-third-party@ci/trtllm-v1.3.0rc15(workflow CUDA 13.1 + lite disabled)Wheel built by workflow run #26304084690 (build steps green; release-publish step failed on GH_TOKEN scope — wheels available as artifacts).
Test plan
ut-tokenspeed-kernelmatrix green on h100/b200/b300/gb200 (mi355 should skip cleanly — trtllm is nvidia-only)test_numerics.pycovers 7 trtllm-backed registrations (3 gemm + 1 moe + 3 quantize) — must stay green