Skip to content

[New backend] cuTile backend for Liger kernels β€” performance data on B200 + integration interestΒ #1205

@xjmxyt

Description

@xjmxyt

Background

Hi Liger team πŸ‘‹

We've been building TileGym, helpful kernel tutorials and examples for tile-based GPU programming.

Over the past few weeks, we have implemented cuTile Python versions of most Liger kernels internally and want to release them in TileGym. We wanted to share some early performance numbers and start a conversation about whether there might be an opportunity to integrate a cuTile backend into Liger down the road.

What we've implemented

We have working internal implementations of the following kernels (tested on NVIDIA B200, bfloat16, Llama-3-8B config β€” hidden_size=4096, intermediate_size=14336):

cross_entropy, dyt, fused_add_rms_norm, fused_linear_cross_entropy, fused_linear_jsd, fused_neighborhood_attention, geglu, group_norm, jsd, kl_div, layer_norm, llama4_rope, poly_norm, qwen2vl_mrope, rms_norm, rope, softmax, sparsemax, swiglu, tiled_geglu, tiled_swiglu, tvd

We currently have tested cuTile-Python backends for Liger-equivalent kernels internally(tested on NVIDIA B200, 1650 MHz, bfloat16, Llama-3-8B config, triton version: 3.7.0+git7d075612). Below are some of our very promising results.
We use the following command to test it.

python benchmark_${kernel}.py --model llama_3_8b --sweep-mode token_length
e.g. jsd

  Full (Fwd + Bwd) Speed + Memory

  β”Œβ”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  BΓ—T  β”‚ Torch Speed β”‚ Liger Speed  β”‚ CuTile Speed β”‚ Speedup β”‚ Liger Mem β”‚ CuTile Mem β”‚ Mem Ratio β”‚
  β”‚       β”‚             β”‚              β”‚              β”‚(vs Liger)β”‚           β”‚            β”‚           β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ 1,024 β”‚     6.12 ms β”‚      6.10 ms β”‚      1.39 ms β”‚  4.38Γ—  β”‚  3,012 MB β”‚   3,012 MB β”‚   1.00Γ—   β”‚
  β”‚ 2,048 β”‚    12.24 ms β”‚     10.56 ms β”‚      2.57 ms β”‚  4.11Γ—  β”‚  6,012 MB β”‚   6,012 MB β”‚   1.00Γ—   β”‚
  β”‚ 4,096 β”‚    24.40 ms β”‚     21.13 ms β”‚      4.95 ms β”‚  4.27Γ—  β”‚ 12,024 MB β”‚  12,024 MB β”‚   1.00Γ—   β”‚
  β”‚ 8,192 β”‚    48.89 ms β”‚     42.14 ms β”‚      9.79 ms β”‚  4.30Γ—  β”‚ 24,048 MB β”‚  24,048 MB β”‚   1.00Γ—   β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  Consistent 4Γ— speedup; memory bit-for-bit identical. Here we use a different BLOCK_SIZE for cuTile

Proposed Integration Plan

Step 1 β€” TileGym side (we own this entirely)

Release the CuTile implementations under tilegym/suites/liger/, publish to PyPI as part of tilegym. No changes needed in Liger for this step.

Step 2 β€” Liger side (we'd like to contribute a PR, pending your approval)

Add an optional CuTile backend that activates when tilegym is installed:

liger_kernel/ops/backends/cutile/
    __init__.py      # try: import tilegym; except: pass
    group_norm.py    # thin dispatch wrappers
    jsd.py
    ...

Questions for the team

  1. Is the pip install tilegym + optional dispatch model acceptable to you? Or would you prefer a different integration shape?
  2. What are your requirements for adding an optional dependency (testing policy, supported platforms)?
  3. Happy to start small β€” would you be open to a draft PR with just jsd as a proof of concept?

Thanks for all the great work on Liger

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions