Releases · tile-ai/tilelang

05 Jun 08:32

LeiWang1999

v0.1.5

a32009b

v0.1.5 Latest

Latest

What's Changed

[Release] Bump version from 0.1.3 into 0.1.4 by @LeiWang1999 in #375
[Enhancement] Remove redundant recursive rewrite rule for FloorDiv in RewriteSimplifier by @LeiWang1999 in #408
[Docker] cu128 Support by @andyluo03 in #410
[Refactor] Phaseout python dependency attrs and decorator by @LeiWang1999 in #411
[Language] make linter and type checker happy with mocking by @YouJiacheng in #407
[Bugfix] Support larger than 256 box size tma copy by @LeiWang1999 in #413
[Enhancement] Add get_nvcc_compiler function to retrieve nvcc path by @LeiWang1999 in #414
Update lower.py to set default value for params by @Alex4210987 in #416
[Enhancement] Support Auto Layout Inference and Parallelism with variable constraint by @LeiWang1999 in #417
[Enhancement] Support to find Cython path more automatically by @FrozenGene in #418
[Refactor] Enhance layout inference logic in ParallelOp by @chengyupku in #420
[BugFix] Fix tvm simplify pass by @smallscientist1 in #421
[Enhancement] Add TMA+WS support in pipeline planning logic by @chengyupku in #422
[Language] Support tile operator T.cumsum by @LeiWang1999 in #423
Delete testing/python/language/test_tilelang_language_reduce_sum.py by @LeiWang1999 in #424
[Bugfix] Fix a bug for simplifier by @LeiWang1999 in #425
[Layout] Enhance layout inference pass by @LeiWang1999 in #427
[Enhancement] Remove DeReplicate during parallel loop layout inference by @LeiWang1999 in #430
[Bugfix] Fix the test data distribution of cumsum by @LeiWang1999 in #432
[Enhancement] Support cute mma tile mxn8ky by @LeiWang1999 in #434
[Bugfix] Removed the behavior that treated global -> local as a copy operation. by @LeiWang1999 in #435
[Language] Support accumulative T.reduce_sum by @LeiWang1999 in #436
[Bugfix] fix the unexpected keyword error of autotune by @yyttt6 in #438
[Testing] Add atomic add test by @LeiWang1999 in #439
[Typo] Rename warp_source to wrap_source by @lucifer1004 in #440
[Refactor] Update KernelLaunch to clarify block name by @LeiWang1999 in #441
[Enhancement] Reduce CPU overhead during kernel execution by @Cunxiao2002 in #437
[Enhancement] Improve layout inference accuracy in ParallelOp by @LeiWang1999 in #442
[Bugfix] Fix layout inference for free fragment buffer by @LeiWang1999 in #443
Bump transformers from 4.48.0 to 4.50.0 in /examples/bitnet-1.58b by @dependabot in #444
[Language] Support explicit programming for identified warp groups by @LeiWang1999 in #445
[Bugfix] Fix safe memory legalization for fragment store by @LeiWang1999 in #446
[Refactor] Separate warp specialize rewriter and tma barrier injector pass by @LeiWang1999 in #447
[Enhancement] Add new examples for warp specialization and TMA integration by @LeiWang1999 in #448
[Refactor] Phaseout torch>=2.2.0 dependency by @LeiWang1999 in #451
[Feature] Add TILELANG_CHECK_LAST_ERROR macro for improved error handling in CUDA and HIP by @LeiWang1999 in #450
[Enhancement] Introduce pass_configs parameter for kernel Caching by @LeiWang1999 in #452
[Feature] Add cache directory management functions in tilelang.cache by @LeiWang1999 in #453
[Bugfix] Fix get_swizzle_layout implementation. by @cherichy in #455
[Refactor] Update barrier functions and add new example for GEMM with warp specialization by @LeiWang1999 in #456
[Refactor] Include examples in CI by @LeiWang1999 in #457
docs: add llvm version info to installation.md. by @AsakusaRinne in #459
[CI] Add elementwise and gemv examples to CI. by @Cunxiao2002 in #458
[Bugfix] Fix for T.copy with dynamic range by @LeiWang1999 in #462
[Bugfix] Fix copy region automation for dynamic extent by @LeiWang1999 in #465
[Feature] Implement fast integer power operation and related API by @LeiWang1999 in #466
[Typo] Rename power_of_int with pow_of_int for consistency by @LeiWang1999 in #468
[CI] Add BlocksparseGemm, Dynamic, and Cast examples to CI by @tzj-fxz in #467
[Refactor] Update set_compile_args to allow None for out_idx parameter by @LeiWang1999 in #469
[Refactor] Simplify buffer_region_to_tile_region function in copy.py by @LeiWang1999 in #470
[CI] Add Convolution example to CI by @xwhzz in #473
[BugFix] Correct argparse for example_convolution test by @xwhzz in #474
[Refactor] set USE_LLVM to optional. by @hyx1999 in #476
[CI] Add Analyzer and blocksparse_attention examples to CI by @yyttt6 in #472
[Refactor] Skip patchelf if not installed by @LeiWang1999 in #477
[Refactor] Improve layout equality checks and error messaging by @LeiWang1999 in #471
[Doc] Update version retrieval in conf.py to read from VERSION file by @xwhzz in #478
Fix Device Consistency in Autotuner Threads and Add Manual Profiler Check by @yuanjypku in #481
[Bugfix] Check CUDA target before checking for TMA by @gau-nernst in #482
[Bugfix] Use AutoTune cache_input_tensors properly by @yyttt6 in #483
Revert "[Bugfix] Use AutoTune cache_input_tensors properly" by @LeiWang1999 in #488
[Enhancement] Support register input for gemm when trans_a or trans_b is true by @LeiWang1999 in #490
[CI] Add flash_decoding example to CI by @xuchangtolearn in #487
[CI] Add Reminder Bot for pull request contributions by @xwhzz in #491
[Refactor] Introduce quantize components of TileLang and add testing for dequant gemm exmaple by @LeiWang1999 in #494
[Enhancement] Introduce flag to visualize shared memory merge plan by @LeiWang1999 in #496
[Refactor] Update main function structure in example scripts and add tests by @chengyupku in #475
[Bugfix] Fix Hopper GEMM layout for small tile size by @LeiWang1999 in #497
[Enhancement] Fallback transposed_ldmatrix into SM75_U16x4_LDSM_N when warp_n is 8 by @LeiWang1999 in #498
[Bugfix] Rename SM75_U16x8_LDSM_N to SM75_U16x8_LDSM_T to reflect correct matrix type by @LeiWang1999 in #499
[Refactor] Update GEMM layout and operand traits for improved CUDA compatibility by @LeiWang1999 in #500
[Refactor] Update JIT kernel functions and streamline GEMM tests by @LeiWang1999 in #501
Fix AMD Docker issues related to conda environment setup by @Hamerlate in #503
[Refactor] Refactor jit to _JitImplementation to support @tilelang.jit by @LeiWang1999 in #502
[Refactor] Adjust in fragment GEMM layout by @LeiWang1999 in #504
[Refactor] Update GlobalMemChecker to Detect Lower Bound illegal memory access automatically by @LeiWang1999 in #505
[Enhancement] Enhance ReduceOp and JITKernel for improved dimension handling and initialization by @LeiWang1999 in #507
[Refactor] Update buffer handling in layout transformation to support layout on T.view by @LeiWang1999 in #509
[Bugfix] Enhance smem copy selector for uncommon shape by @LeiWang1999 in https://github.com/tile-ai/tilelang...

Contributors

FrozenGene, cherichy, and 22 other contributors

Assets 7

0 Join discussion

18 Apr 09:14

LeiWang1999

v0.1.4

a41a473

v0.1.4

What's Changed

[Bugfix] Support T.clear for let binding by @LeiWang1999 in #268
[Bugfix] Add TMA and Producer Buffer Analysis in Warp Specialized Rewriter by @chengyupku in #269
[Refactor] Improve flash attention example and layout comparison logic by @LeiWang1999 in #270
[Bugfix]Add CUDA availability check in CtypesKernelAdapter by @XueSongTap in #267
[CI] Add gemm performance test by @xwhzz in #274
[Language] Introduce T.ptr and T.Tensor by @LeiWang1999 in #276
[Refactor] Enhance Autotune by @yyttt6 in #266
[Refactor] Update cache key generation in KernelCache by @LeiWang1999 in #283
[Docs][Tutorial] Add tutorial for auto-tuning by @yyttt6 in #285
[Refactor] Deprecated T.Buffer as arguments and rename related calls into T.Tensor by @LeiWang1999 in #281
[Doc] Update README.md to correct documentation link for TileLang debug tools by @chengyupku in #286
[Feature] Introduce NoSetMaxNReg for warp specialization by @chengyupku in #289
[Language] Proxy tvm ir to make linter happy by @LeiWang1999 in #287
[Bugfix] Enable bfloat16 atomic operations only for CUDA architectures greater than 7.5 by @LeiWang1999 in #291
[Doc] Update Python API docs generation by @xwhzz in #278
[Doc] Remove citation page by @LeiWang1999 in #292
[Dev] Correcting cxx compiler by @penguin-wwy in #294
[doc/example] add gemv doc and examples by @botbw in #293
[Feature] Implement ParallelLoopTransformer for enhanced loop analysis by @LeiWang1999 in #295
[Enhancement] Update AtomicAdd functions for BFLOAT16 in common.h by @LeiWang1999 in #297
[Refactor] Improve documentation and add detailed docstrings across multiple modules by @LeiWang1999 in #298
[Bugfix] Correct method call for block reduction check when analyzing memory footprint by @NaOHCC in #299
[Dynamic Symbolic] Refactor passes with dynamic symbolic and check shape bound precisely by @tzj-fxz in #302
Add autotune to conv example by @yyttt6 in #301
[Bugfix] Resolve autotuner bugs for blocksparse GEMM example by @tth37 in #300
[Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError by @LeslinD in #305
[Enhancement] Add support for CUDA architecture 8.9 in GEMM template by @LeiWang1999 in #304
[BugFix] Fix unintended Git config overrides in CI runners by @xwhzz in #306
[Cache] Implement in-memory cache by @LeiWang1999 in #308
[Bugfix] Updated autotune usage in the examples to align with the latest changes by @LeiWang1999 in #309
[Bugfix] Fix dynamic axis with variable extent by @LeiWang1999 in #311
[Bugfix] Fix layout conflict issue for gqa decoding examples by @LeiWang1999 in #314
[Bugfix] Fixed the handling logic of IfThenElseNode in if_stmt_binding by @chengyupku in #315
[Bugfix] Fix logic error in ReduceOp when handling CUDA architecture by @chengyupku in #316
[CostModel] Introduce cuda driver api to get precise shared memory capacity by @LeiWang1999 in #317
[Dev] Add FP8 Quantization Examples and Absolute Maximum Reduction Operation Support by @chengyupku in #320
[Tools] Summarize TFLOPS Information from a tilelang program by @yyttt6 in #321
Support block_N sizes that are 2^n in deepgemm example by @zcnrex in #319
[Feat] Enhance CUDA Property Handling by @LeiWang1999 in #322
[Bugfix] add a patch to fix T.abs on float16 by @botbw in #325
[AMD] Adapt rocm and support T.gemm with transpose_b=False for amd backend by @LeiWang1999 in #327
[Dynamic Symbolic] Adaptively vectorize with different condition expressions by @tzj-fxz in #326
[Bugfix] Fix fragment layout annotation in example gqa decode by @LeiWang1999 in #329
[AMD] Support Transpose_A=True and GEMM_RS for hip backend by @LeiWang1999 in #331
[Refactor] Optimize RMS normalization kernel in rms_norm.py by @chengyupku in #333
[AMD] Fix for missing composable kernel include path when compile kernels on amd gpus by @LeiWang1999 in #334
[Example] Add sparse gqa decode example by @xiayuqing0622 in #332
[Enhancement] Enhance FP8/FP4 type handling in CUDA codegen by @LeiWang1999 in #323
[Doc] Fix typo and heading level in GEMV tutorial by @yeh-sudo in #337
[Dev] Add Group Cast FP8 Example by @chengyupku in #338
[Enhancement] Support region padding when convert buffer load to buffer region by @LeiWang1999 in #342
[Example] Add triton block sparse gqa decode by @YizhaoGao in #341
[Enhancement] Support index bit width configuration by @LeiWang1999 in #343
[Bugfix] Fix X_amax Correctness Issue in Group Cast FP8 by @chengyupku in #345
[Bugfix] Fix Transposed Fragment Layout for amd GEMM_RS matrix core by @LeiWang1999 in #346
[AutoTune] Refactor AutoTuneArtifact to utilize kernel as context instead of profiler by @LeiWang1999 in #344
[Bugfix] Compile/"cached" still not loading cached kernel for example in example_mha_bwd by @Alex4210987 in #339
[Refactor] Implement thread-local storage for FrameStack in frame.py and kernel.py by @LeiWang1999 in #352
[Typo] Replace kernel.func with kernel in mla benchmark scripts by @LeiWang1999 in #354
[AMD][Docker] Create Dockerfile for ROCm environment setup by @LeiWang1999 in #355
[Enhancement] Update group_per_split_token_cast_to_fp8 to support multiple data types by @chengyupku in #356
[Enhancement] Support pass config disable_warp_specialize to disable auto specialization on hopper by @LeiWang1999 in #357
[Example] Introduce autotuning example for GEMM with enhanced configuration options by @chengyupku in #360
[Example] Handle Scenarios in Which a Threadblock is Assigned Only Invalid Block Indices for Sparse Attention by @xiayuqing0622 in #361
[Bugfix] Correct dynamic shared memory size error handling in HIP by @LeiWang1999 in #362
[AMD] Implement Deepseek MLA for AMD by @LeiWang1999 in #363
[Bugfix] Fix compilation issues for amd cdna element size check by @LeiWang1999 in #364
[AMD] Support FlashMLA with num split template for AMD gpus by @LeiWang1999 in #366
[MLA][AMD] Add amd mla benchmarking by @LeiWang1999 in #367
[Bugfix] Adjust Autotuner threadpool max_workers limit to available CPUs by @tth37 in #368
[Language] Introduce T.any_of and T.all_of to reduce a bool arrary by @LeiWang1999 in #371
[AMD][Setup] Support HIP in setup.py by @zhhangBian in #369
[Typo] Remove debug print by @LeiWang1999 in #373
[Docs] Add AMD Flash MLA Documentation to Tutorials Section by @LeiWang1999 in #376
[Bugfix] Add filelock for cython build by @LeiWang1999 in #377
[Typo] Remove unused comments generated by copilot by @LeiWang1999 in #379
[Doc] Add deepseek_mla to documentation index by @LeiWang1999 in #380
[Refactor] Remove debug message in pass legalize_safe_memory_access by @LeiWang1999 in #381
[Enhancement][Pipeline] More precise copy code block detection in pipeline by ...

Contributors

zcnrex, penguin-wwy, and 19 other contributors

Assets 7

1 Join discussion

23 Mar 15:21

LeiWang1999

v0.1.3

f308c8a

v0.1.3

What's Changed

[Docker] Add libstdcxx-ng-12 to Dockerfiles for CUDA versions by @LeiWang1999 in #160
Add cpu jit with backend ctypes by @xs-keju in #154
[Carver] Multi-Threads Compilation for Fast Auto Tuning by @SiriusNEO in #156
[Refactor] Replace T.If with native Python if statement for mla paged kernel by @LeiWang1999 in #162
[Enhancement] Improve CUDA path detection by @xwhzz in #157
[Refactor] Replace T.thread_binding with T.get_thread_binding in examples and test cases by @LeiWang1999 in #163
[Bugfix] Cast bool dtype into int8 in blocksparse examples by @LeiWang1999 in #167
[Example] Implement NSA Decode tilelang exampls by @LeiWang1999 in #168
[Release] Bump version to v0.1.2.post1 by @LeiWang1999 in #166
Use SS-GEMM for PV in mla by @YouJiacheng in #165
[Example] Implement tilelang native sparse attention varlen example by @LeiWang1999 in #170
[Bugfix] Implement boundary check for the buffer shape with dynamic symbolic by @LeiWang1999 in #173
[AutoTune] Enable config-performance trace by @LeiWang1999 in #174
[Feat] Append Pass Context and TMA lowering configuration option by @LeiWang1999 in #175
[Feat] Introduce new caching mechanism for compiled kernels by @LeiWang1999 in #176
[Refactor] Enhance GPU Kernel Launch with Environment Thread Creation by @LeiWang1999 in #178
[Bugfix] Improve Thread Variable Handling in Layout Inference by @LeiWang1999 in #179
[Examples] Implement NSA Backward kernels by @LeiWang1999 in #180
[Enhancement] Optimize CMake build process with dynamic job count calculation by @LeiWang1999 in #183
[Bugfix] Add dynamic shape support with out_idx in Cython JIT kernel compilation by @LeiWang1999 in #185
[Dev][Bugfix] Add RMS Normalization Kernels and Fix Reduce Bug by @chengyupku in #188
[Dev] Add the failed nvcc command to the exception message by @penguin-wwy in #189
[Bugfix] Fix T.copy for scalar datatypes by @LeiWang1999 in #190
[Enhancement] Simplify GEMM example with direct kernel compilation by @LeiWang1999 in #191
[Bugfix] Make quickstart work properly on cu118 by @penguin-wwy in #193
[Language] Support clamp in language by @hyx1999 in #192
[Refactor] Add SetMaxNRegCollector to Improve Register Hint Handling in Warp Specialized Rewriter by @chengyupku in #194
[Feature] Add TMA Store Synchronization Support by @chengyupku in #195
Update expired example code. by @66RING in #196
[CMake] Add CUDA Major Version Detection for Conditional Compilation by @chengyupku in #197
[Feature] Support Async Pipeline inference within if scope by @LeiWang1999 in #198
[Dev] Add new example for FlashAttention with pipelined execution by @chengyupku in #200
[Enhancement] Enhancing the handling of conditional statements in the pipeline by @LeiWang1999 in #201
[Feature] Upgrade cutlass version and support fp8 T.gemm by @zqh-wz in #202
[Docker] Update Dockerfiles to specify exact version of libstdcxx-ng by @LeiWang1999 in #203
[Dev] Add GQA backward example by @chengyupku in #205
[LICENSE] Typo fix in LICENSE by @LeiWang1999 in #208
[Enhancement] Allow mma fallback when wgmma is not supported by @LeiWang1999 in #206
[Examples] Expand tuning configurations for FlashAttention example by @chenghuaWang in #204
[Enhancement] Avoid tvm ffi handling when out_idx is specified by @LeiWang1999 in #209
[Fix] Fix K // block_K to T.ceildiv(K,block_K) and add tests by @hyx1999 in #210
[Dev] Implement IfStmtBinding and MergeIfStmt transformations by @chengyupku in #211
[Language] Introduce T.reshape and T.view by @LeiWang1999 in #212
[Enhancement] Improve device handling in Cython kernel adapter by @LeiWang1999 in #220
[Enhancement] Update format script to support force compare with upstream by @LeiWang1999 in #221
[Refactor] Introduce KernelParam integration across modules by @LeiWang1999 in #223
[Bugfix] Fix mismatch of shared memory layout and mma atom on Hopper by @zqh-wz in #224
[Refactor] Update kernel compilation and profiling in examples by @chengyupku in #225
[Examples] Add fp8 gemm 2xAcc and deepgemm example by @cherichy in #217
[Doc] Add instructions for installing nightly version by @xwhzz in #226
[Bugfix] Disable force inline for ldmatrix by @LeiWang1999 in #227
[Bugfix] Support duplicate tma desc declaration by @LeiWang1999 in #228
[Refactor] Rename clamp functions and enhance dtype handling in tests by @LeiWang1999 in #232
[Enhancement] Simplify kernel source extraction in JIT adapters by @LeiWang1999 in #230
[Feature] Add reduce_max corresponding tests by @LeiWang1999 in #236
[BugFix] Fix bug of missing MBarrierExpectTX by @chengyupku in #241
[Refactor] Refactor for Better Layout Conflict Handling by @LeiWang1999 in #240
[Refactor] Align torch_assert_close tensor comparison with torch.testing.assert_close by @xwhzz in #239
[Dev] Implement FlashAttention3 Backward by @chengyupku in #244
[BugFix] Fix bug of mismatching dtype in testing by @xwhzz in #245
[Enhancement] Add zero initialization option to GEMM operations by @chengyupku in #246
[Enhancement][CUDA] Avoid C7508 for CUDA backend via assigning default value to minBlocksPerMultiprocesor by @cherichy in #248
[Feature] Add database storage for JITKernel cache with Cython and Ctypes adapters by @Alex4210987 in #213
[Examples] Implement elementwise add kernel by @chenghuaWang in #219
[Refactor] Phaseout LLVM Dependency by Making it Optional by @LeiWang1999 in #247
[Readme] Update Bib Citation Section by @LeiWang1999 in #249
[Enhancement] Support float variable as arguments by @LeiWang1999 in #250
add autotune to example_gemm.py by @yyttt6 in #252
[Language] Introduce T.alloc_var to define a variable like int var; by @LeiWang1999 in #255
[Example] Implement Kernel Example cumsum by @LeiWang1999 in #258
[Refactor] Refactor CUDA post-processing callback registration in TileLang by @LeiWang1999 in #259
[Refactor] Move compilation outside critical section by @YouJiacheng in #260
[CI] Use auditwheel to generate manylinux wheels by @oraluben in #251
[Bugfix] Fix Benchmark/Example Code for Autotuning by @SiriusNEO in #254
[Language] Enhance alias to support blockwise memory load by @LeiWang1999 in #261
[Bugfix] Fix auto tuning tma handling by @LeiWang1999 in #263
[Release] Bump version to 0.1.3 by @LeiWang1999 in #264

New Contributors

@xs-keju made their first contribution in #154
@YouJiacheng made their first contribution in #165
@penguin-wwy made their first contribution in #189
@hyx1999 made their first contribution in #192
@66RING made their first contribution in https://github.com/tile-ai/tilelang/pull/...

Contributors

oraluben, cherichy, and 13 other contributors

Assets 7

0 Join discussion

07 Mar 13:18

LeiWang1999

v0.1.2.post1

d8a06c0

v0.1.2.post1

Why we need this post release?

The v0.1.2 prebuild package used a legacy cython file, which may lead to some bugs.

What's Changed

[Docker] Add libstdcxx-ng-12 to Dockerfiles for CUDA versions by @LeiWang1999 in #160
Add cpu jit with backend ctypes by @xs-keju in #154
[Carver] Multi-Threads Compilation for Fast Auto Tuning by @SiriusNEO in #156
[Refactor] Replace T.If with native Python if statement for mla paged kernel by @LeiWang1999 in #162
[Enhancement] Improve CUDA path detection by @xwhzz in #157
[Refactor] Replace T.thread_binding with T.get_thread_binding in examples and test cases by @LeiWang1999 in #163
[Bugfix] Cast bool dtype into int8 in blocksparse examples by @LeiWang1999 in #167
[Example] Implement NSA Decode tilelang exampls by @LeiWang1999 in #168

New Contributors

@xs-keju made their first contribution in #154

Full Changelog: v0.1.2...v0.1.2.post1

Contributors

LeiWang1999, SiriusNEO, and 2 other contributors

Assets 7

0 Join discussion

06 Mar 10:02

LeiWang1999

v0.1.2

c8c7dec

v0.1.2

What's Changed

[Dev] Add MLA and GQA decode examples by @chengyupku in #109
[Example] Add Split-K and Stream-K Examples and move MLA from fld to mla by @LeiWang1999 in #110
[Typo] Fix a typo in gemm splitk examples by @LeiWang1999 in #111
[Typo] Fix links in installation instructions in README.md by @xwhzz in #112
[Typo] Fix formatting in installation instructions in README.md by @xwhzz in #113
[Benchmark] Add benchmark scripts for block sparse attention by @LeiWang1999 in #114
[Dev] Support vectorized value pack and atomicAdd for BFloat16 DType by @LeiWang1999 in #116
[Bugfix] Bugfix of pass order for hopper by @chengyupku in #117
[Dev] Update MLA decode kernel by @chengyupku in #120
[Example] Add GQA Example by @LeiWang1999 in #118
[Example] Implement TileLang Native Sparse Attention Kernel by @LeiWang1999 in #121
[Doc] Update README.md with new example links for Flash MLA Decoding and Native Sparse Attention by @chengyupku in #122
[Example] Update GEMM FP8 Example by @LeiWang1999 in #123
[Dev] Add RetNet Linear Attention example by @chengyupku in #124
[JIT] Enhance cython/ctypes wrapper for tma descriptor by @LeiWang1999 in #126
[Dev][Bugfix] Fix bug in ThreadTagChecker; Add WgmmaSync rewriter and add MHA WGMMA pipelined example by @chengyupku in #128
[Dev] Remove buffer flatten when debug print a shared buffer by @LeiWang1999 in #129
[Debug] Support T.print for fragment scope by @LeiWang1999 in #130
[Example] Implememt FMHA Varlen Example by @LeiWang1999 in #131
[Refactor] Set default log level from waning into info by @LeiWang1999 in #132
[Kernel] Implement different SEQ Q/KV examples with block sparse by @LeiWang1999 in #133
[Dev][Doc] Add DeepSeek MLA Decode Example with Documentation and Performance Benchmarks by @chengyupku in #134
[Doc] Update MLA Documentation by @chengyupku in #135
[Debug] Improve Memory Layout Plot by @LeiWang1999 in #136
[Doc] Add MLA Decoding Performance Benchmarks and Documentation by @chengyupku in #137
[Bugfix] Add missing definition for AtomicAdd by @LeiWang1999 in #138
[Dev][Doc] Enhance Flash Attention Implementation in GQA Decoding Example and Fix Typo by @chengyupku in #139
[Dev] Adjust computation logic to avoid precision loss when casting acc_s from float to float16 by @chengyupku in #141
[Refactor] Rename gemm fp8 example as we currently lack T.gemm support for fp8 by @LeiWang1999 in #144
[Enhancement] Support debug print for unsigned char datatype by @LeiWang1999 in #145
[Enhancement] Enable runtime tensor data type validation by @LeiWang1999 in #146
[Refactor] Adapt Caver to benchmark by @LeiWang1999 in #148
[Refactor] Remove BitBLAS Import Check in Benchmark by @SiriusNEO in #150
[Enhancement] Optimize TileLang install scripts with Dynamic CPU Cores by @LeiWang1999 in #152
[Carver] Enhance Carver Adaptation for MatMul Benchmarking by @LeiWang1999 in #153
[Dev][Benchmark] Add MLA paged decoding example and benchmark script by @chengyupku in #158
[Release] Bump Version to v0.1.2 by @LeiWang1999 in #155

New Contributors

@SiriusNEO made their first contribution in #150

Full Changelog: v0.1.1...v0.1.2

Contributors

LeiWang1999, SiriusNEO, and 2 other contributors

Assets 7

0 Join discussion

23 Feb 09:32

LeiWang1999

v0.1.1

59342bb

v0.1.1

What's Changed

[Doc] Update release news by @LeiWang1999 in #80
[Doc] Convert docs from rst format to Markdown format. by @xwhzz in #82
[Bugfix] Bugfix of installing with develop mode by @LeiWang1999 in #81
[WHL] Support whl building for different python versions via tox by @LeiWang1999 in #83
[Refactor] Separate tilelang Pass Thread Sync (with Hopper support) from tvm by @LeiWang1999 in #85
[Backend][WebGPU] Support WebGPU WGSL code generation by @LeiWang1999 in #86
[Wheel] Support pypi build scripts for different python via tox by @LeiWang1999 in #93
[Wrap] Use a ctypes-based kernel wrapper instead of dlpack for runtime efficiency by @LeiWang1999 in #95
[Bugfix] Update Dockerfile.cu120 by @LeiWang1999 in #98
[Bugfix] Put InjectPtxAsyncCopy Pass behind ThreadSync Pass by @LeiWang1999 in #97
[Feature] Add CTypes JIT kernel support by @LeiWang1999 in #100
[Docker] Add Dockerfiles for multiple CUDA versions by @LeiWang1999 in #103
[JIT] Support Cython jit and make cython a default execution backend by @LeiWang1999 in #102
[Refactor] Phrase out torch cpp extension backend by @LeiWang1999 in #104
[Wheel] Provide a bare docker scripts to help build wheels for manylinux by @LeiWang1999 in #105
[Example] Implement simple block sparse kernel by @LeiWang1999 in #106
[Release] Bumpy version to v0.1.1 by @LeiWang1999 in #107

Full Changelog: v0.1.0...v0.1.1

Contributors

LeiWang1999 and xwhzz

Assets 7

0 Join discussion

12 Feb 06:23

LeiWang1999

v0.1.0

02a2cba

v0.1.0

What's Changed

[LICENSE] Add LICENSE for flashinfer by @LeiWang1999 in #19
[Doc] Fix installation scripts and docs for dequantize gemm by @LeiWang1999 in #20
[Doc] Use sphinx to generate docs. by @xwhzz in #21
[Doc] update installation.md and readme by @Cunxiao2002 in #22
[Doc] fix a typo in installation.rst by @Cunxiao2002 in #24
[Doc] Remove legacy files and update reference by @LeiWang1999 in #25
[CI][Test] Add test cases for tilelang transform AnnotateDeviceRegions and MakePackedAPI by @LeiWang1999 in #26
[Doc] Create a workflow to host docs using GitHub Pages. by @xwhzz in #28
[CI][Test] Add test cases for tilelang transform InjectSoftwarePipeline and FrontendLegalize by @Cunxiao2002 in #30
[Bugfix] Replace thread binding detector in LayoutInference Pass by @LeiWang1999 in #31
[CI] Comprehensive Test cases Implementation of Matmul Dequantize by @LeiWang1999 in #32
[Doc] Update GitHub Actions workflow for documentation deployment and add CNAME file. by @xwhzz in #33
[Refactor] Simplify interface via replacing argument thread binding of intrinsics with KernelFrame.Current by @LeiWang1999 in #34
[Bugfix] Reorder Passes: Place Vectorize Loop Before StorageFlatten and FlattenBuffer to Prevent Redundant Allocations by @LeiWang1999 in #37
[Doc] Update documentation structure and content by @LeiWang1999 in #39
[Doc][CI] Update GitHub Actions workflow for documentation build and deployment. by @xwhzz in #42
[CI] Allow manual triggering of documentation workflow in addition to… by @xwhzz in #43
[CI][Test] Add test cases for tilelang transform PipelinePlanning by @Cunxiao2002 in #44
[CI][Test] Add test cases for tilelang transform LayoutInference and LowerTileOp on loop tail split functionality by @tzj-fxz in #29
[Debug] Introduce T.print for buffer and variables logging on frontend by @LeiWang1999 in #45
[CI] Change pull request trigger to pull_request_target for documen… by @xwhzz in #48
[Dev] Add FlashDecoding example by @chengyupku in #46
[Doc] update README that tilelang has been used in AttentionEngine by @smallscientist1 in #50
[Doc] Remove unnecessary layout annotation by @LeiWang1999 in #49
[CI][Test] Add test cases for tilelang kernel convolution by @chengyupku in #51
[Dev] Implement test case for tilelang transformations by @LeiWang1999 in #53
[CI][Test] Add test cases for tilelang kernel FlashAttention by @chengyupku in #54
[CI][Test] Add test cases for element_add by @Cunxiao2002 in #47
[CI] Clean up target repository before publishing documentation. by @xwhzz in #55
[CI][Test] Add test cases for tilelang transform ClusterPlanning by @chengyupku in #57
[Doc] Append debug relevant testing and documentations by @LeiWang1999 in #58
[CI][Test] Add test cases for tilelang transform LowerHopperIntrin by @chengyupku in #59
[Doc] Add matmul kernel tutorial with tile library by @LeiWang1999 in #60
[Dev] Separate LoopVectorize Pass from upstream tvm by @LeiWang1999 in #62
[Dev] Support FP8 Codegen for cuda backend by @LeiWang1999 in #64
[Dev] Add test case for bfloat16 and int4 gemm with mma by @LeiWang1999 in #65
[CI][Test] Add test cases for tilelang transform InjectFenceProxy by @chengyupku in #66
[Tools] Introduce plot_layout to visualize the fragment layout by @LeiWang1999 in #68
[Dev] Remove unnecessary python dependencies by @LeiWang1999 in #69
[Carver] Introduce a tile-structure based cost model for auto tuning by @LeiWang1999 in #70
[Bugfix] bug fix for bitblas dependency by @LeiWang1999 in #71
[CI][Test] Add test cases for tilelang transform MultiVersionBuffer and WarpSpecialized by @chengyupku in #72
[CostModel][Carver] Support Hint Recommend for Shared memory Kernel Fusion by @LeiWang1999 in #73
[Carver] Remove legacy todo items in carver's readme by @LeiWang1999 in #74
[Dev] Add mha backward example by @chengyupku in #77
[Release] Bump version into v0.1.0 by @LeiWang1999 in #76

New Contributors

@xwhzz made their first contribution in #21
@Cunxiao2002 made their first contribution in #22
@tzj-fxz made their first contribution in #29
@chengyupku made their first contribution in #46
@smallscientist1 made their first contribution in #50

Full Changelog: v0.0.1...v0.1.0

Contributors

LeiWang1999, chengyupku, and 4 other contributors

Assets 7

0 Join discussion

20 Jan 12:54

LeiWang1999

v0.0.1

473977b

TileLang v0.0.1 Pre-release Pre-release

Pre-release

Pre-release for the v0.0.1. Under testing, Only cuda prebuilt are provided.

What's Changed

[Doc] Update the example figures in README by @LeiWang1999 in #3
[Doc] Replace SVG Figures with PNG due to some format issues by @LeiWang1999 in #4
[Dev][Language] Separate Base AST with Sugar Syntax by @LeiWang1999 in #9
[Dev] Enhance examples on README by @LeiWang1999 in #10
[Doc] Revert repo link by @LeiWang1999 in #11
[Dev][jit] Introduce jit for kernel functions by @LeiWang1999 in #12
Update README.md by @rkinas in #14
[CI] Remove Code QL workflow by @LeiWang1999 in #16
[Doc] Add benchmark link in README by @LeiWang1999 in #17
[Release] Bump Version into 0.0.1 by @LeiWang1999 in #18

New Contributors

@LeiWang1999 made their first contribution in #3
@rkinas made their first contribution in #14

Full Changelog: https://github.com/tile-ai/tilelang/commits/v0.0.1

Contributors

LeiWang1999 and rkinas

Assets 5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What's Changed

Contributors

Uh oh!

What's Changed

Contributors

Uh oh!

What's Changed

New Contributors

Contributors

Uh oh!

Why we need this post release?

What's Changed

New Contributors

Contributors

Uh oh!

What's Changed

New Contributors

Contributors

Uh oh!

What's Changed

Contributors

Uh oh!

What's Changed

New Contributors

Contributors

Uh oh!

What's Changed

New Contributors

Contributors

Uh oh!

Releases: tile-ai/tilelang

v0.1.5

What's Changed

Contributors

Uh oh!

v0.1.4

What's Changed

Contributors

Uh oh!

v0.1.3

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.2.post1

Why we need this post release?

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.2

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.1

What's Changed

Contributors

Uh oh!

v0.1.0

What's Changed

New Contributors

Contributors

Uh oh!

TileLang v0.0.1 Pre-release

What's Changed

New Contributors

Contributors

Uh oh!