Background
Hi Liger team π
We've been building TileGym, helpful kernel tutorials and examples for tile-based GPU programming.
Over the past few weeks, we have implemented cuTile Python versions of most Liger kernels internally and want to release them in TileGym. We wanted to share some early performance numbers and start a conversation about whether there might be an opportunity to integrate a cuTile backend into Liger down the road.
What we've implemented
We have working internal implementations of the following kernels (tested on NVIDIA B200, bfloat16, Llama-3-8B config β hidden_size=4096, intermediate_size=14336):
cross_entropy, dyt, fused_add_rms_norm, fused_linear_cross_entropy, fused_linear_jsd, fused_neighborhood_attention, geglu, group_norm, jsd, kl_div, layer_norm, llama4_rope, poly_norm, qwen2vl_mrope, rms_norm, rope, softmax, sparsemax, swiglu, tiled_geglu, tiled_swiglu, tvd
We currently have tested cuTile-Python backends for Liger-equivalent kernels internally(tested on NVIDIA B200, 1650 MHz, bfloat16, Llama-3-8B config, triton version: 3.7.0+git7d075612). Below are some of our very promising results.
We use the following command to test it.
python benchmark_${kernel}.py --model llama_3_8b --sweep-mode token_length
e.g. jsd
Full (Fwd + Bwd) Speed + Memory
βββββββββ¬ββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬ββββββββββ¬ββββββββββββ¬βββββββββββββ¬ββββββββββββ
β BΓT β Torch Speed β Liger Speed β CuTile Speed β Speedup β Liger Mem β CuTile Mem β Mem Ratio β
β β β β β(vs Liger)β β β β
βββββββββΌββββββββββββββΌβββββββββββββββΌβββββββββββββββΌββββββββββΌββββββββββββΌβββββββββββββΌββββββββββββ€
β 1,024 β 6.12 ms β 6.10 ms β 1.39 ms β 4.38Γ β 3,012 MB β 3,012 MB β 1.00Γ β
β 2,048 β 12.24 ms β 10.56 ms β 2.57 ms β 4.11Γ β 6,012 MB β 6,012 MB β 1.00Γ β
β 4,096 β 24.40 ms β 21.13 ms β 4.95 ms β 4.27Γ β 12,024 MB β 12,024 MB β 1.00Γ β
β 8,192 β 48.89 ms β 42.14 ms β 9.79 ms β 4.30Γ β 24,048 MB β 24,048 MB β 1.00Γ β
βββββββββ΄ββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ΄ββββββββββ΄ββββββββββββ΄βββββββββββββ΄ββββββββββββ
Consistent 4Γ speedup; memory bit-for-bit identical. Here we use a different BLOCK_SIZE for cuTile
Proposed Integration Plan
Step 1 β TileGym side (we own this entirely)
Release the CuTile implementations under tilegym/suites/liger/, publish to PyPI as part of tilegym. No changes needed in Liger for this step.
Step 2 β Liger side (we'd like to contribute a PR, pending your approval)
Add an optional CuTile backend that activates when tilegym is installed:
liger_kernel/ops/backends/cutile/
__init__.py # try: import tilegym; except: pass
group_norm.py # thin dispatch wrappers
jsd.py
...
Questions for the team
- Is the pip install tilegym + optional dispatch model acceptable to you? Or would you prefer a different integration shape?
- What are your requirements for adding an optional dependency (testing policy, supported platforms)?
- Happy to start small β would you be open to a draft PR with just jsd as a proof of concept?
Thanks for all the great work on Liger
Background
Hi Liger team π
We've been building TileGym, helpful kernel tutorials and examples for tile-based GPU programming.
Over the past few weeks, we have implemented cuTile Python versions of most Liger kernels internally and want to release them in TileGym. We wanted to share some early performance numbers and start a conversation about whether there might be an opportunity to integrate a cuTile backend into Liger down the road.
What we've implemented
We have working internal implementations of the following kernels (tested on NVIDIA B200, bfloat16, Llama-3-8B config β hidden_size=4096, intermediate_size=14336):
We currently have tested cuTile-Python backends for Liger-equivalent kernels internally(tested on NVIDIA B200, 1650 MHz, bfloat16, Llama-3-8B config, triton version: 3.7.0+git7d075612). Below are some of our very promising results.
We use the following command to test it.
Proposed Integration Plan
Step 1 β TileGym side (we own this entirely)
Release the CuTile implementations under tilegym/suites/liger/, publish to PyPI as part of tilegym. No changes needed in Liger for this step.
Step 2 β Liger side (we'd like to contribute a PR, pending your approval)
Add an optional CuTile backend that activates when tilegym is installed:
Questions for the team
Thanks for all the great work on Liger