Releases: tile-ai/tilelang
Releases · tile-ai/tilelang
v0.1.5
What's Changed
- [Release] Bump version from 0.1.3 into 0.1.4 by @LeiWang1999 in #375
- [Enhancement] Remove redundant recursive rewrite rule for FloorDiv in RewriteSimplifier by @LeiWang1999 in #408
- [Docker] cu128 Support by @andyluo03 in #410
- [Refactor] Phaseout python dependency
attrs
anddecorator
by @LeiWang1999 in #411 - [Language] make linter and type checker happy with mocking by @YouJiacheng in #407
- [Bugfix] Support larger than 256 box size tma copy by @LeiWang1999 in #413
- [Enhancement] Add get_nvcc_compiler function to retrieve nvcc path by @LeiWang1999 in #414
- Update lower.py to set default value for params by @Alex4210987 in #416
- [Enhancement] Support Auto Layout Inference and Parallelism with variable constraint by @LeiWang1999 in #417
- [Enhancement] Support to find Cython path more automatically by @FrozenGene in #418
- [Refactor] Enhance layout inference logic in ParallelOp by @chengyupku in #420
- [BugFix] Fix tvm simplify pass by @smallscientist1 in #421
- [Enhancement] Add TMA+WS support in pipeline planning logic by @chengyupku in #422
- [Language] Support tile operator
T.cumsum
by @LeiWang1999 in #423 - Delete testing/python/language/test_tilelang_language_reduce_sum.py by @LeiWang1999 in #424
- [Bugfix] Fix a bug for simplifier by @LeiWang1999 in #425
- [Layout] Enhance layout inference pass by @LeiWang1999 in #427
- [Enhancement] Remove DeReplicate during parallel loop layout inference by @LeiWang1999 in #430
- [Bugfix] Fix the test data distribution of cumsum by @LeiWang1999 in #432
- [Enhancement] Support cute mma tile mxn8ky by @LeiWang1999 in #434
- [Bugfix] Removed the behavior that treated global -> local as a copy operation. by @LeiWang1999 in #435
- [Language] Support accumulative
T.reduce_sum
by @LeiWang1999 in #436 - [Bugfix] fix the unexpected keyword error of autotune by @yyttt6 in #438
- [Testing] Add atomic add test by @LeiWang1999 in #439
- [Typo] Rename warp_source to wrap_source by @lucifer1004 in #440
- [Refactor] Update KernelLaunch to clarify block name by @LeiWang1999 in #441
- [Enhancement] Reduce CPU overhead during kernel execution by @Cunxiao2002 in #437
- [Enhancement] Improve layout inference accuracy in ParallelOp by @LeiWang1999 in #442
- [Bugfix] Fix layout inference for free fragment buffer by @LeiWang1999 in #443
- Bump transformers from 4.48.0 to 4.50.0 in /examples/bitnet-1.58b by @dependabot in #444
- [Language] Support explicit programming for identified warp groups by @LeiWang1999 in #445
- [Bugfix] Fix safe memory legalization for fragment store by @LeiWang1999 in #446
- [Refactor] Separate warp specialize rewriter and tma barrier injector pass by @LeiWang1999 in #447
- [Enhancement] Add new examples for warp specialization and TMA integration by @LeiWang1999 in #448
- [Refactor] Phaseout torch>=2.2.0 dependency by @LeiWang1999 in #451
- [Feature] Add TILELANG_CHECK_LAST_ERROR macro for improved error handling in CUDA and HIP by @LeiWang1999 in #450
- [Enhancement] Introduce pass_configs parameter for kernel Caching by @LeiWang1999 in #452
- [Feature] Add cache directory management functions in tilelang.cache by @LeiWang1999 in #453
- [Bugfix] Fix get_swizzle_layout implementation. by @cherichy in #455
- [Refactor] Update barrier functions and add new example for GEMM with warp specialization by @LeiWang1999 in #456
- [Refactor] Include examples in CI by @LeiWang1999 in #457
- docs: add llvm version info to installation.md. by @AsakusaRinne in #459
- [CI] Add elementwise and gemv examples to CI. by @Cunxiao2002 in #458
- [Bugfix] Fix for T.copy with dynamic range by @LeiWang1999 in #462
- [Bugfix] Fix copy region automation for dynamic extent by @LeiWang1999 in #465
- [Feature] Implement fast integer power operation and related API by @LeiWang1999 in #466
- [Typo] Rename
power_of_int
withpow_of_int
for consistency by @LeiWang1999 in #468 - [CI] Add BlocksparseGemm, Dynamic, and Cast examples to CI by @tzj-fxz in #467
- [Refactor] Update set_compile_args to allow None for out_idx parameter by @LeiWang1999 in #469
- [Refactor] Simplify buffer_region_to_tile_region function in copy.py by @LeiWang1999 in #470
- [CI] Add Convolution example to CI by @xwhzz in #473
- [BugFix] Correct argparse for example_convolution test by @xwhzz in #474
- [Refactor] set USE_LLVM to optional. by @hyx1999 in #476
- [CI] Add Analyzer and blocksparse_attention examples to CI by @yyttt6 in #472
- [Refactor] Skip patchelf if not installed by @LeiWang1999 in #477
- [Refactor] Improve layout equality checks and error messaging by @LeiWang1999 in #471
- [Doc] Update version retrieval in conf.py to read from VERSION file by @xwhzz in #478
- Fix Device Consistency in Autotuner Threads and Add Manual Profiler Check by @yuanjypku in #481
- [Bugfix] Check CUDA target before checking for TMA by @gau-nernst in #482
- [Bugfix] Use AutoTune cache_input_tensors properly by @yyttt6 in #483
- Revert "[Bugfix] Use AutoTune cache_input_tensors properly" by @LeiWang1999 in #488
- [Enhancement] Support register input for gemm when trans_a or trans_b is true by @LeiWang1999 in #490
- [CI] Add flash_decoding example to CI by @xuchangtolearn in #487
- [CI] Add Reminder Bot for pull request contributions by @xwhzz in #491
- [Refactor] Introduce quantize components of TileLang and add testing for dequant gemm exmaple by @LeiWang1999 in #494
- [Enhancement] Introduce flag to visualize shared memory merge plan by @LeiWang1999 in #496
- [Refactor] Update main function structure in example scripts and add tests by @chengyupku in #475
- [Bugfix] Fix Hopper GEMM layout for small tile size by @LeiWang1999 in #497
- [Enhancement] Fallback transposed_ldmatrix into
SM75_U16x4_LDSM_N
when warp_n is 8 by @LeiWang1999 in #498 - [Bugfix] Rename SM75_U16x8_LDSM_N to SM75_U16x8_LDSM_T to reflect correct matrix type by @LeiWang1999 in #499
- [Refactor] Update GEMM layout and operand traits for improved CUDA compatibility by @LeiWang1999 in #500
- [Refactor] Update JIT kernel functions and streamline GEMM tests by @LeiWang1999 in #501
- Fix AMD Docker issues related to conda environment setup by @Hamerlate in #503
- [Refactor] Refactor
jit
to_JitImplementation
to support@tilelang.jit
by @LeiWang1999 in #502 - [Refactor] Adjust in fragment GEMM layout by @LeiWang1999 in #504
- [Refactor] Update GlobalMemChecker to Detect Lower Bound illegal memory access automatically by @LeiWang1999 in #505
- [Enhancement] Enhance ReduceOp and JITKernel for improved dimension handling and initialization by @LeiWang1999 in #507
- [Refactor] Update buffer handling in layout transformation to support layout on
T.view
by @LeiWang1999 in #509 - [Bugfix] Enhance smem copy selector for uncommon shape by @LeiWang1999 in https://github.com/tile-ai/tilelang...
v0.1.4
What's Changed
- [Bugfix] Support
T.clear
for let binding by @LeiWang1999 in #268 - [Bugfix] Add TMA and Producer Buffer Analysis in Warp Specialized Rewriter by @chengyupku in #269
- [Refactor] Improve flash attention example and layout comparison logic by @LeiWang1999 in #270
- [Bugfix]Add CUDA availability check in CtypesKernelAdapter by @XueSongTap in #267
- [CI] Add gemm performance test by @xwhzz in #274
- [Language] Introduce
T.ptr
andT.Tensor
by @LeiWang1999 in #276 - [Refactor] Enhance Autotune by @yyttt6 in #266
- [Refactor] Update cache key generation in KernelCache by @LeiWang1999 in #283
- [Docs][Tutorial] Add tutorial for auto-tuning by @yyttt6 in #285
- [Refactor] Deprecated
T.Buffer
as arguments and rename related calls intoT.Tensor
by @LeiWang1999 in #281 - [Doc] Update README.md to correct documentation link for TileLang debug tools by @chengyupku in #286
- [Feature] Introduce NoSetMaxNReg for warp specialization by @chengyupku in #289
- [Language] Proxy tvm ir to make linter happy by @LeiWang1999 in #287
- [Bugfix] Enable bfloat16 atomic operations only for CUDA architectures greater than 7.5 by @LeiWang1999 in #291
- [Doc] Update Python API docs generation by @xwhzz in #278
- [Doc] Remove citation page by @LeiWang1999 in #292
- [Dev] Correcting cxx compiler by @penguin-wwy in #294
- [doc/example] add gemv doc and examples by @botbw in #293
- [Feature] Implement ParallelLoopTransformer for enhanced loop analysis by @LeiWang1999 in #295
- [Enhancement] Update AtomicAdd functions for BFLOAT16 in common.h by @LeiWang1999 in #297
- [Refactor] Improve documentation and add detailed docstrings across multiple modules by @LeiWang1999 in #298
- [Bugfix] Correct method call for block reduction check when analyzing memory footprint by @NaOHCC in #299
- [Dynamic Symbolic] Refactor passes with dynamic symbolic and check shape bound precisely by @tzj-fxz in #302
- Add autotune to conv example by @yyttt6 in #301
- [Bugfix] Resolve autotuner bugs for blocksparse GEMM example by @tth37 in #300
- [Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError by @LeslinD in #305
- [Enhancement] Add support for CUDA architecture 8.9 in GEMM template by @LeiWang1999 in #304
- [BugFix] Fix unintended Git config overrides in CI runners by @xwhzz in #306
- [Cache] Implement in-memory cache by @LeiWang1999 in #308
- [Bugfix] Updated autotune usage in the examples to align with the latest changes by @LeiWang1999 in #309
- [Bugfix] Fix dynamic axis with variable extent by @LeiWang1999 in #311
- [Bugfix] Fix layout conflict issue for gqa decoding examples by @LeiWang1999 in #314
- [Bugfix] Fixed the handling logic of IfThenElseNode in if_stmt_binding by @chengyupku in #315
- [Bugfix] Fix logic error in ReduceOp when handling CUDA architecture by @chengyupku in #316
- [CostModel] Introduce cuda driver api to get precise shared memory capacity by @LeiWang1999 in #317
- [Dev] Add FP8 Quantization Examples and Absolute Maximum Reduction Operation Support by @chengyupku in #320
- [Tools] Summarize TFLOPS Information from a tilelang program by @yyttt6 in #321
- Support block_N sizes that are 2^n in deepgemm example by @zcnrex in #319
- [Feat] Enhance CUDA Property Handling by @LeiWang1999 in #322
- [Bugfix] add a patch to fix T.abs on float16 by @botbw in #325
- [AMD] Adapt rocm and support
T.gemm
with transpose_b=False for amd backend by @LeiWang1999 in #327 - [Dynamic Symbolic] Adaptively vectorize with different condition expressions by @tzj-fxz in #326
- [Bugfix] Fix fragment layout annotation in example gqa decode by @LeiWang1999 in #329
- [AMD] Support
Transpose_A=True
and GEMM_RS for hip backend by @LeiWang1999 in #331 - [Refactor] Optimize RMS normalization kernel in rms_norm.py by @chengyupku in #333
- [AMD] Fix for missing composable kernel include path when compile kernels on amd gpus by @LeiWang1999 in #334
- [Example] Add sparse gqa decode example by @xiayuqing0622 in #332
- [Enhancement] Enhance FP8/FP4 type handling in CUDA codegen by @LeiWang1999 in #323
- [Doc] Fix typo and heading level in GEMV tutorial by @yeh-sudo in #337
- [Dev] Add Group Cast FP8 Example by @chengyupku in #338
- [Enhancement] Support region padding when convert buffer load to buffer region by @LeiWang1999 in #342
- [Example] Add triton block sparse gqa decode by @YizhaoGao in #341
- [Enhancement] Support index bit width configuration by @LeiWang1999 in #343
- [Bugfix] Fix X_amax Correctness Issue in Group Cast FP8 by @chengyupku in #345
- [Bugfix] Fix Transposed Fragment Layout for amd GEMM_RS matrix core by @LeiWang1999 in #346
- [AutoTune] Refactor AutoTuneArtifact to utilize kernel as context instead of profiler by @LeiWang1999 in #344
- [Bugfix] Compile/"cached" still not loading cached kernel for example in example_mha_bwd by @Alex4210987 in #339
- [Refactor] Implement thread-local storage for FrameStack in frame.py and kernel.py by @LeiWang1999 in #352
- [Typo] Replace
kernel.func
withkernel
in mla benchmark scripts by @LeiWang1999 in #354 - [AMD][Docker] Create Dockerfile for ROCm environment setup by @LeiWang1999 in #355
- [Enhancement] Update group_per_split_token_cast_to_fp8 to support multiple data types by @chengyupku in #356
- [Enhancement] Support pass config
disable_warp_specialize
to disable auto specialization on hopper by @LeiWang1999 in #357 - [Example] Introduce autotuning example for GEMM with enhanced configuration options by @chengyupku in #360
- [Example] Handle Scenarios in Which a Threadblock is Assigned Only Invalid Block Indices for Sparse Attention by @xiayuqing0622 in #361
- [Bugfix] Correct dynamic shared memory size error handling in HIP by @LeiWang1999 in #362
- [AMD] Implement Deepseek MLA for AMD by @LeiWang1999 in #363
- [Bugfix] Fix compilation issues for amd cdna element size check by @LeiWang1999 in #364
- [AMD] Support FlashMLA with num split template for AMD gpus by @LeiWang1999 in #366
- [MLA][AMD] Add amd mla benchmarking by @LeiWang1999 in #367
- [Bugfix] Adjust Autotuner threadpool
max_workers
limit to available CPUs by @tth37 in #368 - [Language] Introduce
T.any_of
andT.all_of
to reduce a bool arrary by @LeiWang1999 in #371 - [AMD][Setup] Support HIP in setup.py by @zhhangBian in #369
- [Typo] Remove debug print by @LeiWang1999 in #373
- [Docs] Add AMD Flash MLA Documentation to Tutorials Section by @LeiWang1999 in #376
- [Bugfix] Add filelock for cython build by @LeiWang1999 in #377
- [Typo] Remove unused comments generated by copilot by @LeiWang1999 in #379
- [Doc] Add deepseek_mla to documentation index by @LeiWang1999 in #380
- [Refactor] Remove debug message in pass legalize_safe_memory_access by @LeiWang1999 in #381
- [Enhancement][Pipeline] More precise copy code block detection in pipeline by ...
v0.1.3
What's Changed
- [Docker] Add libstdcxx-ng-12 to Dockerfiles for CUDA versions by @LeiWang1999 in #160
- Add cpu jit with backend ctypes by @xs-keju in #154
- [Carver] Multi-Threads Compilation for Fast Auto Tuning by @SiriusNEO in #156
- [Refactor] Replace T.If with native Python if statement for mla paged kernel by @LeiWang1999 in #162
- [Enhancement] Improve CUDA path detection by @xwhzz in #157
- [Refactor] Replace
T.thread_binding
withT.get_thread_binding
in examples and test cases by @LeiWang1999 in #163 - [Bugfix] Cast bool dtype into int8 in blocksparse examples by @LeiWang1999 in #167
- [Example] Implement NSA Decode tilelang exampls by @LeiWang1999 in #168
- [Release] Bump version to v0.1.2.post1 by @LeiWang1999 in #166
- Use SS-GEMM for PV in mla by @YouJiacheng in #165
- [Example] Implement tilelang native sparse attention varlen example by @LeiWang1999 in #170
- [Bugfix] Implement boundary check for the buffer shape with dynamic symbolic by @LeiWang1999 in #173
- [AutoTune] Enable config-performance trace by @LeiWang1999 in #174
- [Feat] Append Pass Context and TMA lowering configuration option by @LeiWang1999 in #175
- [Feat] Introduce new caching mechanism for compiled kernels by @LeiWang1999 in #176
- [Refactor] Enhance GPU Kernel Launch with Environment Thread Creation by @LeiWang1999 in #178
- [Bugfix] Improve Thread Variable Handling in Layout Inference by @LeiWang1999 in #179
- [Examples] Implement NSA Backward kernels by @LeiWang1999 in #180
- [Enhancement] Optimize CMake build process with dynamic job count calculation by @LeiWang1999 in #183
- [Bugfix] Add dynamic shape support with out_idx in Cython JIT kernel compilation by @LeiWang1999 in #185
- [Dev][Bugfix] Add RMS Normalization Kernels and Fix Reduce Bug by @chengyupku in #188
- [Dev] Add the failed nvcc command to the exception message by @penguin-wwy in #189
- [Bugfix] Fix
T.copy
for scalar datatypes by @LeiWang1999 in #190 - [Enhancement] Simplify GEMM example with direct kernel compilation by @LeiWang1999 in #191
- [Bugfix] Make quickstart work properly on cu118 by @penguin-wwy in #193
- [Language] Support clamp in language by @hyx1999 in #192
- [Refactor] Add SetMaxNRegCollector to Improve Register Hint Handling in Warp Specialized Rewriter by @chengyupku in #194
- [Feature] Add TMA Store Synchronization Support by @chengyupku in #195
- Update expired example code. by @66RING in #196
- [CMake] Add CUDA Major Version Detection for Conditional Compilation by @chengyupku in #197
- [Feature] Support Async Pipeline inference within if scope by @LeiWang1999 in #198
- [Dev] Add new example for FlashAttention with pipelined execution by @chengyupku in #200
- [Enhancement] Enhancing the handling of conditional statements in the pipeline by @LeiWang1999 in #201
- [Feature] Upgrade cutlass version and support fp8 T.gemm by @zqh-wz in #202
- [Docker] Update Dockerfiles to specify exact version of libstdcxx-ng by @LeiWang1999 in #203
- [Dev] Add GQA backward example by @chengyupku in #205
- [LICENSE] Typo fix in LICENSE by @LeiWang1999 in #208
- [Enhancement] Allow mma fallback when wgmma is not supported by @LeiWang1999 in #206
- [Examples] Expand tuning configurations for FlashAttention example by @chenghuaWang in #204
- [Enhancement] Avoid tvm ffi handling when out_idx is specified by @LeiWang1999 in #209
- [Fix] Fix K // block_K to T.ceildiv(K,block_K) and add tests by @hyx1999 in #210
- [Dev] Implement IfStmtBinding and MergeIfStmt transformations by @chengyupku in #211
- [Language] Introduce
T.reshape
andT.view
by @LeiWang1999 in #212 - [Enhancement] Improve device handling in Cython kernel adapter by @LeiWang1999 in #220
- [Enhancement] Update format script to support force compare with upstream by @LeiWang1999 in #221
- [Refactor] Introduce KernelParam integration across modules by @LeiWang1999 in #223
- [Bugfix] Fix mismatch of shared memory layout and mma atom on Hopper by @zqh-wz in #224
- [Refactor] Update kernel compilation and profiling in examples by @chengyupku in #225
- [Examples] Add fp8 gemm 2xAcc and deepgemm example by @cherichy in #217
- [Doc] Add instructions for installing nightly version by @xwhzz in #226
- [Bugfix] Disable force inline for ldmatrix by @LeiWang1999 in #227
- [Bugfix] Support duplicate tma desc declaration by @LeiWang1999 in #228
- [Refactor] Rename clamp functions and enhance dtype handling in tests by @LeiWang1999 in #232
- [Enhancement] Simplify kernel source extraction in JIT adapters by @LeiWang1999 in #230
- [Feature] Add reduce_max corresponding tests by @LeiWang1999 in #236
- [BugFix] Fix bug of missing MBarrierExpectTX by @chengyupku in #241
- [Refactor] Refactor for Better Layout Conflict Handling by @LeiWang1999 in #240
- [Refactor] Align torch_assert_close tensor comparison with torch.testing.assert_close by @xwhzz in #239
- [Dev] Implement FlashAttention3 Backward by @chengyupku in #244
- [BugFix] Fix bug of mismatching dtype in testing by @xwhzz in #245
- [Enhancement] Add zero initialization option to GEMM operations by @chengyupku in #246
- [Enhancement][CUDA] Avoid C7508 for CUDA backend via assigning default value to
minBlocksPerMultiprocesor
by @cherichy in #248 - [Feature] Add database storage for JITKernel cache with Cython and Ctypes adapters by @Alex4210987 in #213
- [Examples] Implement elementwise add kernel by @chenghuaWang in #219
- [Refactor] Phaseout LLVM Dependency by Making it Optional by @LeiWang1999 in #247
- [Readme] Update Bib Citation Section by @LeiWang1999 in #249
- [Enhancement] Support float variable as arguments by @LeiWang1999 in #250
- add autotune to example_gemm.py by @yyttt6 in #252
- [Language] Introduce
T.alloc_var
to define a variable likeint var;
by @LeiWang1999 in #255 - [Example] Implement Kernel Example cumsum by @LeiWang1999 in #258
- [Refactor] Refactor CUDA post-processing callback registration in TileLang by @LeiWang1999 in #259
- [Refactor] Move compilation outside critical section by @YouJiacheng in #260
- [CI] Use auditwheel to generate manylinux wheels by @oraluben in #251
- [Bugfix] Fix Benchmark/Example Code for Autotuning by @SiriusNEO in #254
- [Language] Enhance alias to support blockwise memory load by @LeiWang1999 in #261
- [Bugfix] Fix auto tuning tma handling by @LeiWang1999 in #263
- [Release] Bump version to 0.1.3 by @LeiWang1999 in #264
New Contributors
- @xs-keju made their first contribution in #154
- @YouJiacheng made their first contribution in #165
- @penguin-wwy made their first contribution in #189
- @hyx1999 made their first contribution in #192
- @66RING made their first contribution in https://github.com/tile-ai/tilelang/pull/...
v0.1.2.post1
Why we need this post release?
The v0.1.2 prebuild package used a legacy cython file, which may lead to some bugs.
What's Changed
- [Docker] Add libstdcxx-ng-12 to Dockerfiles for CUDA versions by @LeiWang1999 in #160
- Add cpu jit with backend ctypes by @xs-keju in #154
- [Carver] Multi-Threads Compilation for Fast Auto Tuning by @SiriusNEO in #156
- [Refactor] Replace T.If with native Python if statement for mla paged kernel by @LeiWang1999 in #162
- [Enhancement] Improve CUDA path detection by @xwhzz in #157
- [Refactor] Replace
T.thread_binding
withT.get_thread_binding
in examples and test cases by @LeiWang1999 in #163 - [Bugfix] Cast bool dtype into int8 in blocksparse examples by @LeiWang1999 in #167
- [Example] Implement NSA Decode tilelang exampls by @LeiWang1999 in #168
New Contributors
- @xs-keju made their first contribution in #154
Full Changelog: v0.1.2...v0.1.2.post1
v0.1.2
What's Changed
- [Dev] Add MLA and GQA decode examples by @chengyupku in #109
- [Example] Add Split-K and Stream-K Examples and move MLA from fld to mla by @LeiWang1999 in #110
- [Typo] Fix a typo in gemm splitk examples by @LeiWang1999 in #111
- [Typo] Fix links in installation instructions in README.md by @xwhzz in #112
- [Typo] Fix formatting in installation instructions in README.md by @xwhzz in #113
- [Benchmark] Add benchmark scripts for block sparse attention by @LeiWang1999 in #114
- [Dev] Support vectorized value pack and atomicAdd for BFloat16 DType by @LeiWang1999 in #116
- [Bugfix] Bugfix of pass order for hopper by @chengyupku in #117
- [Dev] Update MLA decode kernel by @chengyupku in #120
- [Example] Add GQA Example by @LeiWang1999 in #118
- [Example] Implement TileLang Native Sparse Attention Kernel by @LeiWang1999 in #121
- [Doc] Update README.md with new example links for Flash MLA Decoding and Native Sparse Attention by @chengyupku in #122
- [Example] Update GEMM FP8 Example by @LeiWang1999 in #123
- [Dev] Add RetNet Linear Attention example by @chengyupku in #124
- [JIT] Enhance cython/ctypes wrapper for tma descriptor by @LeiWang1999 in #126
- [Dev][Bugfix] Fix bug in ThreadTagChecker; Add WgmmaSync rewriter and add MHA WGMMA pipelined example by @chengyupku in #128
- [Dev] Remove buffer flatten when debug print a shared buffer by @LeiWang1999 in #129
- [Debug] Support
T.print
forfragment
scope by @LeiWang1999 in #130 - [Example] Implememt FMHA Varlen Example by @LeiWang1999 in #131
- [Refactor] Set default log level from waning into info by @LeiWang1999 in #132
- [Kernel] Implement different SEQ Q/KV examples with block sparse by @LeiWang1999 in #133
- [Dev][Doc] Add DeepSeek MLA Decode Example with Documentation and Performance Benchmarks by @chengyupku in #134
- [Doc] Update MLA Documentation by @chengyupku in #135
- [Debug] Improve Memory Layout Plot by @LeiWang1999 in #136
- [Doc] Add MLA Decoding Performance Benchmarks and Documentation by @chengyupku in #137
- [Bugfix] Add missing definition for AtomicAdd by @LeiWang1999 in #138
- [Dev][Doc] Enhance Flash Attention Implementation in GQA Decoding Example and Fix Typo by @chengyupku in #139
- [Dev] Adjust computation logic to avoid precision loss when casting acc_s from float to float16 by @chengyupku in #141
- [Refactor] Rename gemm fp8 example as we currently lack
T.gemm
support for fp8 by @LeiWang1999 in #144 - [Enhancement] Support debug print for unsigned char datatype by @LeiWang1999 in #145
- [Enhancement] Enable runtime tensor data type validation by @LeiWang1999 in #146
- [Refactor] Adapt Caver to benchmark by @LeiWang1999 in #148
- [Refactor] Remove BitBLAS Import Check in Benchmark by @SiriusNEO in #150
- [Enhancement] Optimize TileLang install scripts with Dynamic CPU Cores by @LeiWang1999 in #152
- [Carver] Enhance Carver Adaptation for MatMul Benchmarking by @LeiWang1999 in #153
- [Dev][Benchmark] Add MLA paged decoding example and benchmark script by @chengyupku in #158
- [Release] Bump Version to v0.1.2 by @LeiWang1999 in #155
New Contributors
- @SiriusNEO made their first contribution in #150
Full Changelog: v0.1.1...v0.1.2
v0.1.1
What's Changed
- [Doc] Update release news by @LeiWang1999 in #80
- [Doc] Convert docs from rst format to Markdown format. by @xwhzz in #82
- [Bugfix] Bugfix of installing with develop mode by @LeiWang1999 in #81
- [WHL] Support whl building for different python versions via tox by @LeiWang1999 in #83
- [Refactor] Separate tilelang Pass Thread Sync (with Hopper support) from tvm by @LeiWang1999 in #85
- [Backend][WebGPU] Support WebGPU WGSL code generation by @LeiWang1999 in #86
- [Wheel] Support pypi build scripts for different python via tox by @LeiWang1999 in #93
- [Wrap] Use a ctypes-based kernel wrapper instead of dlpack for runtime efficiency by @LeiWang1999 in #95
- [Bugfix] Update Dockerfile.cu120 by @LeiWang1999 in #98
- [Bugfix] Put
InjectPtxAsyncCopy
Pass behindThreadSync
Pass by @LeiWang1999 in #97 - [Feature] Add CTypes JIT kernel support by @LeiWang1999 in #100
- [Docker] Add Dockerfiles for multiple CUDA versions by @LeiWang1999 in #103
- [JIT] Support Cython jit and make cython a default execution backend by @LeiWang1999 in #102
- [Refactor] Phrase out torch cpp extension backend by @LeiWang1999 in #104
- [Wheel] Provide a bare docker scripts to help build wheels for manylinux by @LeiWang1999 in #105
- [Example] Implement simple block sparse kernel by @LeiWang1999 in #106
- [Release] Bumpy version to v0.1.1 by @LeiWang1999 in #107
Full Changelog: v0.1.0...v0.1.1
v0.1.0
What's Changed
- [LICENSE] Add LICENSE for flashinfer by @LeiWang1999 in #19
- [Doc] Fix installation scripts and docs for dequantize gemm by @LeiWang1999 in #20
- [Doc] Use sphinx to generate docs. by @xwhzz in #21
- [Doc] update installation.md and readme by @Cunxiao2002 in #22
- [Doc] fix a typo in installation.rst by @Cunxiao2002 in #24
- [Doc] Remove legacy files and update reference by @LeiWang1999 in #25
- [CI][Test] Add test cases for tilelang transform
AnnotateDeviceRegions
andMakePackedAPI
by @LeiWang1999 in #26 - [Doc] Create a workflow to host docs using GitHub Pages. by @xwhzz in #28
- [CI][Test] Add test cases for tilelang transform InjectSoftwarePipeline and FrontendLegalize by @Cunxiao2002 in #30
- [Bugfix] Replace thread binding detector in LayoutInference Pass by @LeiWang1999 in #31
- [CI] Comprehensive Test cases Implementation of Matmul Dequantize by @LeiWang1999 in #32
- [Doc] Update GitHub Actions workflow for documentation deployment and add CNAME file. by @xwhzz in #33
- [Refactor] Simplify interface via replacing argument thread binding of intrinsics with
KernelFrame.Current
by @LeiWang1999 in #34 - [Bugfix] Reorder Passes: Place Vectorize Loop Before StorageFlatten and FlattenBuffer to Prevent Redundant Allocations by @LeiWang1999 in #37
- [Doc] Update documentation structure and content by @LeiWang1999 in #39
- [Doc][CI] Update GitHub Actions workflow for documentation build and deployment. by @xwhzz in #42
- [CI] Allow manual triggering of documentation workflow in addition to… by @xwhzz in #43
- [CI][Test] Add test cases for tilelang transform PipelinePlanning by @Cunxiao2002 in #44
- [CI][Test] Add test cases for tilelang transform
LayoutInference
andLowerTileOp
on loop tail split functionality by @tzj-fxz in #29 - [Debug] Introduce
T.print
for buffer and variables logging on frontend by @LeiWang1999 in #45 - [CI] Change pull request trigger to
pull_request_target
for documen… by @xwhzz in #48 - [Dev] Add FlashDecoding example by @chengyupku in #46
- [Doc] update README that tilelang has been used in AttentionEngine by @smallscientist1 in #50
- [Doc] Remove unnecessary layout annotation by @LeiWang1999 in #49
- [CI][Test] Add test cases for tilelang kernel convolution by @chengyupku in #51
- [Dev] Implement test case for tilelang transformations by @LeiWang1999 in #53
- [CI][Test] Add test cases for tilelang kernel FlashAttention by @chengyupku in #54
- [CI][Test] Add test cases for element_add by @Cunxiao2002 in #47
- [CI] Clean up target repository before publishing documentation. by @xwhzz in #55
- [CI][Test] Add test cases for tilelang transform ClusterPlanning by @chengyupku in #57
- [Doc] Append debug relevant testing and documentations by @LeiWang1999 in #58
- [CI][Test] Add test cases for tilelang transform LowerHopperIntrin by @chengyupku in #59
- [Doc] Add matmul kernel tutorial with tile library by @LeiWang1999 in #60
- [Dev] Separate
LoopVectorize
Pass from upstream tvm by @LeiWang1999 in #62 - [Dev] Support FP8 Codegen for cuda backend by @LeiWang1999 in #64
- [Dev] Add test case for bfloat16 and int4 gemm with mma by @LeiWang1999 in #65
- [CI][Test] Add test cases for tilelang transform InjectFenceProxy by @chengyupku in #66
- [Tools] Introduce
plot_layout
to visualize the fragment layout by @LeiWang1999 in #68 - [Dev] Remove unnecessary python dependencies by @LeiWang1999 in #69
- [Carver] Introduce a tile-structure based cost model for auto tuning by @LeiWang1999 in #70
- [Bugfix] bug fix for bitblas dependency by @LeiWang1999 in #71
- [CI][Test] Add test cases for tilelang transform MultiVersionBuffer and WarpSpecialized by @chengyupku in #72
- [CostModel][Carver] Support Hint Recommend for Shared memory Kernel Fusion by @LeiWang1999 in #73
- [Carver] Remove legacy todo items in carver's readme by @LeiWang1999 in #74
- [Dev] Add mha backward example by @chengyupku in #77
- [Release] Bump version into v0.1.0 by @LeiWang1999 in #76
New Contributors
- @xwhzz made their first contribution in #21
- @Cunxiao2002 made their first contribution in #22
- @tzj-fxz made their first contribution in #29
- @chengyupku made their first contribution in #46
- @smallscientist1 made their first contribution in #50
Full Changelog: v0.0.1...v0.1.0
TileLang v0.0.1 Pre-release
Pre-release for the v0.0.1. Under testing, Only cuda prebuilt are provided.
What's Changed
- [Doc] Update the example figures in README by @LeiWang1999 in #3
- [Doc] Replace SVG Figures with PNG due to some format issues by @LeiWang1999 in #4
- [Dev][Language] Separate Base AST with Sugar Syntax by @LeiWang1999 in #9
- [Dev] Enhance examples on README by @LeiWang1999 in #10
- [Doc] Revert repo link by @LeiWang1999 in #11
- [Dev][jit] Introduce jit for kernel functions by @LeiWang1999 in #12
- Update README.md by @rkinas in #14
- [CI] Remove Code QL workflow by @LeiWang1999 in #16
- [Doc] Add benchmark link in README by @LeiWang1999 in #17
- [Release] Bump Version into 0.0.1 by @LeiWang1999 in #18
New Contributors
- @LeiWang1999 made their first contribution in #3
- @rkinas made their first contribution in #14
Full Changelog: https://github.com/tile-ai/tilelang/commits/v0.0.1