Skip to content

Releases: flashinfer-ai/flashinfer

v0.2.11

09 Aug 04:51
fc88829
Compare
Choose a tag to compare
v0.2.11 Pre-release
Pre-release

What's Changed

  • Fix flag order by @nandor in #1392
  • Add flags to trim down AoT builds by @nandor in #1393
  • Force upgrade cuDNN to latest by @paul841029 in #1401
  • Adding FP8 benchmark on attention and matmul testing by @bkryu in #1390
  • feature: enable cublas for fp4 gemm when cudnn == 9.11.1 or >= 9.13 by @ttyio in #1405
  • Relax the clear_cuda_cache by @yongwww in #1406
  • Update autotune results for the nvfp4 cutlass moe backends for v0.2.9 by @kaixih in #1361
  • fix shared memory alignment conflict in sampling.cuh by @842974287 in #1402
  • Fix trtllm moe launcher local_num_experts by @wenscarl in #1398
  • [bugfix] Fix compilation failure when compiling csrc/trtllm_moe_allreduce_fusion.cu by @nvpohanh in #1410
  • install: remove nvidia-cudnn-12 from package dependency by @yzh119 in #1409
  • Add mypy to pre-commit by @cyx-6 in #1179
  • feat(aot): add nvshmem module for aot compilation by @EmilienM in #1261
  • Add ruff to pre-commit by @cyx-6 in #1201
  • install: remove nvidia-nvshmem-cu12 from package dependency by @EmilienM in #1426
  • Fix redundant kernels in moe by @fzyzcjy in #1428
  • ci: add arm64 to release-ci-docker.yml by @yzh119 in #1429
  • Fix crash when pos_encoding_mode is passed as int by @kaixih in #1413
  • Fix trtllm_ar failure by @nvpohanh in #1423
  • Use self hosted runner for arm image build by @yongwww in #1433
  • Remote const qualifier to avoid compilation error by @842974287 in #1421
  • Add multi-arch Docker image for x86-64 and arm64 by @yongwww in #1431
  • Add NOTICE with copyrights by @sricketts in #1432
  • Fix FusedMoeRunner does not exist error by @nvpohanh in #1424
  • Putting back cudnn_batch_prefill_with_kv_cache that was deleted by ruff by @bkryu in #1438
  • Decouple cutlass config version from flashinfer version by @kaixih in #1441
  • feat: Fused rope fp8 quantize kernel for MLA by @yzh119 in #1339

New Contributors

Full Changelog: v0.2.10...v0.2.11

v0.2.10

05 Aug 17:45
7c79b41
Compare
Choose a tag to compare

What's Changed

  • GPT-OSS Support: Add Blackwell MoE mxfp4 implementation from TRTLLM and Attention Sink by @joker-eph in #1389
  • release: bump version to v0.2.10 by @yzh119 in #1391

Full Changelog: v0.2.9...v0.2.10

v0.2.9

05 Aug 00:37
9158fef
Compare
Choose a tag to compare

What's Changed

  • Reduce the JIT compilation time of gen_gemm_sm100_module by @jinyangyuan-nvidia in #1251
  • fix: correctly pass k_scale and v_scale to run() in forward_return_lse (#1023) by @vlev02 in #1254
  • Made AR output optional + esthetic changes by @nvmbreughe in #1265
  • init add gemm fp8 using cudnn backend by @ttyio in #1264
  • Feature/sm100 low latency nvfp4 kernels by @azhurkevich in #1214
  • CI: install nvidia-nvshmem-cu12 by @EmilienM in #1262
  • feat: enable trtllm-gen mla MTP by @yyihuang in #1258
  • Add trtllm-gen attention mha kernel with FP8 Q/K/V and FP8 output by @weireweire in #1242
  • add trtllm-gen context attention by @IwakuraRein in #1239
  • feat: add masked deepgemm support and benchmarking by @cyx-6 in #1266
  • Add missing import in comm/init,py by @joker-eph in #1275
  • hotfix: fix deepgemm artifactory hash by @cyx-6 in #1278
  • Unify groupwise fp8 GEMM test by @cyx-6 in #1281
  • fix: update trtllm-gen fmha benchmark by @yyihuang in #1280
  • fix multiCtasKvScratchPtr misalignment issue (new one) by @nvpohanh in #1286
  • Fix install folder regression, and JIT-vs-AOT differences by @directhex in #1279
  • Add shuffle matrix flag by @aleozlx in #1272
  • Convert scale_factor from scalar to Tensor in trt_allreduce_fusion by @ilmarkov in #1284
  • patch error handling by @aleozlx in #1293
  • Bug fix: guard fp8 e8m0 and e2m1 compile by @Edenzzzz in #1287
  • refactor: Improved metainfo for trtllm-gen fmha by @cyx-6 in #1292
  • add mm_fp4 use cudnn backend by @ttyio in #1288
  • fix: minor errors in cubin loader by @yyihuang in #1295
  • perfix: use lightweight API to query device property by @azhurkevich in #1298
  • refactor: refactor trtllm-gen attention kernel integration code by @yzh119 in #1289
  • Remove FAST_BUILD FLAG for MOE by @wenscarl in #1291
  • bugfix: ensure graph is captured and executed on the same stream to avoid rep… by @elfiegg in #1303
  • minor: some fix and cleanup for trtllm-gen mha by @yyihuang in #1302
  • [Feature] SM level profiler by @Edenzzzz in #1305
  • Heuristics + testing unification + CUDA Graphs by @azhurkevich in #1306
  • Update cutlass fp4 moe kernels by @wenscarl in #1294
  • Fix the bug of the kernel-selection heuristic in trtllm-gen by @PerkzZheng in #1307
  • test qkvo quantization not equal to 1. by @weireweire in #1314
  • [fix] fix integer overflow in FA2 customized_mask & add buffer overflow warning. by @happierpig in #1290
  • Addition of flashinfer_benchmark.py for benchmarking routines by @bkryu in #1323
  • minor: update devcontainer by @yyihuang in #1329
  • Fix redundant argument in TrtllmGenDecodeModule by @IwakuraRein in #1326
  • Optimizations for TRTLLM MNNVL Allreduce by @timlee0212 in #1321
  • add torch float4_e2m1fn_x2 check for cudnn fp4 backend by @ttyio in #1333
  • only add cudnn dependency for x86 platform by @ttyio in #1332
  • Make Fp8 MoE routing_bias optional by @aleozlx in #1319
  • feat: Add weight layout option for trtllm-gen fused moe by @aleozlx in #1297
  • [Fix] remove torch 2.8 requirement for FP4 GEMM by @elfiegg in #1334
  • Bug fix: fix duplicate launch in POD by @Edenzzzz in #1267
  • Add blockwise-scaled FP8 GEMM via TRTLLM-Gen. by @sergachev in #1320
  • feat: support output nvfp4 in trtllm-gen function call. by @weireweire in #1318
  • Fix bench deepgemm setting by @cyx-6 in #1344
  • fix: fix trtllm-gen mla error on new interface by @yyihuang in #1348
  • [Bugfix] Change max_size for LRU by @elfiegg in #1349
  • Support loading autotuned results from json for cutlass fp4 moe backends by @kaixih in #1310
  • Refactor scripts in benchmarks to use flasinfer.testing.bench_gpu_time by @bkryu in #1337
  • bugfix: Change default index in routingTopKExperts by @amirkl94 in #1347
  • Support passing kv_data_type to MultiLevelCascadeAttentionWrapper.plan() by @sarckk in #1350
  • Add trtllm-gen prefill test. Fix related wrapper issue. by @weireweire in #1346
  • feat: Support logits_soft_cap for Persistent attn; fix kv split limit by @Edenzzzz in #1324
  • chore: remove cpp benchmarks, tests, cmake path, as they are deprecated by @hypdeb in #1345
  • minor: add trtllm_gen_mla benchmark by @yyihuang in #1316
  • cleanup: retire aot-build-utils by @yzh119 in #1354
  • minor: more informative error message for buffer overflow by @Edenzzzz in #1357
  • gen_trtllm_comm_module: fix device capability detection by @dtrifiro in #1356
  • Refactor Fused Moe Module by @wenscarl in #1309
  • Add native cudnn_decode for improved cudnn decode performance by @Anerudhan in #1283
  • Update CI docker container to use latest cudnn by @yzh119 in #1362
  • feature: add fp4 mm using trtllm backend by @ttyio in #1355
  • support trtllm-gen prefill fp4 output by @weireweire in #1360
  • Allow cudnn prefill kernels to be called natively by @Anerudhan in #1317
  • bugfix: fix ci for aot-compile by @yzh119 in #1364
  • feat: auto deduce use_oneshot from token_num in all-reduce by @yyihuang in #1365
  • add cutlass backend for mm_fp4 by @ttyio in #1296
  • Support scale factor start index for fp4 mha prefill/decode by @weireweire in #1363
  • test: add cuda graph to comm test by @yyihuang in #1366
  • ci: add requests to ci docker container by @yzh119 in #1370
  • Artifact downloading and single sourced artifact path by @cyx-6 in #1369
  • [fix] remove (view) transpose to keep consistent with majorness MN requirement. by @elfiegg in #1358
  • hotfix: update mxfp4 groupwise-scaled gemm unittests by @yzh119 in #1359
  • bugfix: fixed cutlass fused moe usage of FP4QuantizationSFLayout::SWIZZLED by @yzh119 in #1371
  • ci: add blackwell unittest scripts by @yzh119 in #1372
  • Update documentation index by @cyx-6 in #1374
  • bugfix: do cudnn related error check only when cudnn backend is enabled. by @ttyio in #1377
  • bugfix: Add guard for fp4/fp8 related include headers by @yzh119 in #1376
  • refactor: download trtllm gemm metadata from server by @ttyio in #1378
  • Fix sphinx error by @cyx-6 in #1380
  • release: bump version to v0.2.9 by @yzh119 in #1381

New Contributors

...

Read more

v0.2.9rc2

27 Jul 05:18
Compare
Choose a tag to compare
v0.2.9rc2 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: v0.2.8...v0.2.9rc2

v0.2.9rc1

23 Jul 08:01
Compare
Choose a tag to compare
v0.2.9rc1 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: v0.2.8...v0.2.9rc1

v0.2.8

15 Jul 06:52
3f8317c
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.2.7.post1...v0.2.8

v0.2.8rc1

08 Jul 18:30
728e8bb
Compare
Choose a tag to compare
v0.2.8rc1 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: v0.2.7.post1...v0.2.8rc1

v0.2.7.post1

01 Jul 18:14
3fb73b3
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.2.7...v0.2.7.post1

v0.2.7

30 Jun 19:39
4d3fb6d
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.2.6.post1...v0.2.7

v0.2.6.post1

07 Jun 03:24
bc50f1a
Compare
Choose a tag to compare

What's Changed

  • [CI] Add x86_64 tag for x86 self-hosted runner by @yongwww in #1126
  • hotfix: fix installation script behavior by @yzh119 in #1125

Full Changelog: v0.2.6...v0.2.6.post1