tile-ai
diff --git a/‎README.md
Lines changed: 52 additions & 7 deletions b/‎README.md
Lines changed: 52 additions & 7 deletions
diff --git a/‎docker/Dockerfile.cu120
Lines changed: 1 addition & 1 deletion b/‎docker/Dockerfile.cu120
Lines changed: 1 addition & 1 deletion
diff --git a/‎docker/README.md
Lines changed: 1 addition & 1 deletion b/‎docker/README.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/Installation.md
Lines changed: 6 additions & 6 deletions b/‎docs/Installation.md
Lines changed: 6 additions & 6 deletions
diff --git a/‎examples/quickstart.py
Lines changed: 94 additions & 0 deletions b/‎examples/quickstart.py
Lines changed: 94 additions & 0 deletions
@@ -1,3 +1,5 @@
+<img src=./images/logo-row.svg />
+
 <div align="center">
 
 # Tile Language
@@ -57,7 +59,7 @@ pip install tilelang
 Alternatively, you can install directly from the GitHub repository:
 
 ```bash
-pip install git+https://github.com/TileLang/tile-lang
+pip install git+https://github.com/tile-ai/tilelang
 ```
 
 Or install locally:
@@ -82,6 +84,9 @@ In this section, you’ll learn how to write and execute a straightforward GEMM
 Below is an example that demonstrates more advanced features: layout annotation, parallelized copy, and swizzle for improved L2 cache locality. This snippet shows how to adapt your kernel to maximize performance on complex hardware.
 
 ```python
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+import tilelang
 import tilelang.language as T
 # `make_mma_swizzle_layout` is a python defined layout function
 # specifically designed for for MMA operations
@@ -91,6 +96,7 @@ from tilelang.intrinsics import (
     make_mma_swizzle_layout as make_swizzle_layout,)
 
 def matmul(M, N, K, block_M, block_N, block_K, dtype="float16", accum_dtype="float"):
+    # add decorator @tilelang.jit if you want to return a torch function
     @T.prim_func
     def main(
         A: T.Buffer((M, K), dtype),
@@ -105,13 +111,13 @@ def matmul(M, N, K, block_M, block_N, block_K, dtype="float16", accum_dtype="flo
 
             # Apply layout optimizations or define your own layout (Optional)
             # If not specified, we will deduce the layout automatically
-            T.annotate_layout({
-                A_shared: make_swizzle_layout(A_shared),
-                B_shared: make_swizzle_layout(B_shared),
-            })
+            # T.annotate_layout({
+            #     A_shared: make_swizzle_layout(A_shared),
+            #     B_shared: make_swizzle_layout(B_shared),
+            # })
 
             # Enable rasterization for better L2 cache locality (Optional)
-            T.use_swizzle(panel_size=10, enable=True)
+            # T.use_swizzle(panel_size=10, enable=True)
 
             # Clear local accumulation
             T.clear(C_local)
@@ -133,6 +139,45 @@ def matmul(M, N, K, block_M, block_N, block_K, dtype="float16", accum_dtype="flo
             T.copy(C_local, C[by * block_M, bx * block_N])
 
     return main
+
+
+# 1. Define the kernel (matmul) and compile/lower it into an executable module
+func = matmul(1024, 1024, 1024, 128, 128, 32)
+
+# 2. Compile the kernel into a torch function
+# out_idx specifies the index of the output buffer in the argument list
+# if out_idx is specified, the tensor will be created during runtime
+# target currently can be "cuda" or "hip" or "cpu".
+jit_kernel = tilelang.JITKernel(func, out_idx=[2], target="cuda")
+
+# 3. Test the kernel in Python with PyTorch data
+import torch
+
+# Create random input tensors on the GPU
+a = torch.randn(1024, 1024, device="cuda", dtype=torch.float16)
+b = torch.randn(1024, 1024, device="cuda", dtype=torch.float16)
+
+
+# Run the kernel through the Profiler
+c = jit_kernel(a, b)
+
+# Reference multiplication using PyTorch
+ref_c = a @ b
+
+# Validate correctness
+torch.testing.assert_close(c, ref_c, rtol=1e-2, atol=1e-2)
+print("Kernel output matches PyTorch reference.")
+
+# 4. Retrieve and inspect the generated CUDA source (optional)
+cuda_source = jit_kernel.get_kernel_source()
+print("Generated CUDA kernel:\n", cuda_source)
+
+# 5.Pofile latency with kernel
+profiler = jit_kernel.get_profiler()
+
+latency = profiler.do_bench()
+
+print(f"Latency: {latency} ms")
 ```
 
 ### Dive Deep into TileLang Beyond GEMM
@@ -152,4 +197,4 @@ TileLang has now been used in project [BitBLAS](https://github.com/microsoft/Bit
 
 ## Acknowledgements
 
-We learned a lot from the [TVM](https://github.com/apache/tvm) community and would like to thank them for their contributions.
+We learned a lot from the [TVM](https://github.com/apache/tvm) community and would like to thank them for their contributions. The initial version of this project is mainly contributed by [LeiWang1999](https://github.com/LeiWang1999), [chengyupku](https://github.com/chengyupku) and [nox-410](https://github.com/nox-410). Part of this work was done during the internship at Microsoft Research, under the supervision of Dr. Lingxiao Ma, Dr. Yuqing Xia, Dr. Jilong Xue, and Dr. Fan Yang.
@@ -22,7 +22,7 @@ RUN conda install pip cmake && conda clean --all
 
 RUN apt-get install -y python3 python3-dev python3-setuptools gcc libtinfo-dev zlib1g-dev build-essential cmake libedit-dev libxml2-dev
 
-RUN git clone https://github.com/TileLang/tile-lang.git --recursive -b main TileLang \
+RUN git clone https://github.com/tile-ai/tilelang.git --recursive -b main TileLang \
   && cd TileLang && ./install.sh
 
 CMD bash
@@ -1,7 +1,7 @@
 To ease the process of installing all the dependencies, we provide a Dockerfile and a simple guideline to build a Docker image with all of above installed. The Docker image is built on top of Ubuntu 20.04, and it contains all the dependencies required to run the experiments. We only provide the Dockerfile for NVIDIA GPU, and the Dockerfile for AMD GPU will be provided upon request.
 
 ```bash
-git clone --recursive https://github.com/TileLang/tile-lang TileLang
+git clone --recursive https://github.com/tile-ai/tilelang TileLang
 cd TileLang/docker
 # build the image, this may take a while (around 10+ minutes on our test machine)
 docker build -t tilelang_cuda -f Dockerfile.cu120 .
 
@@ -9,7 +9,7 @@
 
 The easiest way to install TileLang is directly from the PyPi using pip. To install the latest version, run the following command in your terminal.
 
-**Note**: Currently, TileLang whl is only supported on Ubuntu 20.04 or later version as we build the whl files on this platform. Currently we only provide whl files for CUDA>=11.0 and with Python>=3.8. **If you are using a different platform or environment, you may need to [build TileLang from source](https://github.com/TileLang/tile-lang/blob/main/docs/Installation.md#building-from-source).**
+**Note**: Currently, TileLang whl is only supported on Ubuntu 20.04 or later version as we build the whl files on this platform. Currently we only provide whl files for CUDA>=11.0 and with Python>=3.8. **If you are using a different platform or environment, you may need to [build TileLang from source](https://github.com/tile-ai/tilelang/blob/main/docs/Installation.md#building-from-source).**
 
 ```bash
 pip install tilelang
@@ -24,7 +24,7 @@ pip install tilelang-0.0.0.dev0+ubuntu.20.4.cu120-py3-none-any.whl
 To install the latest version of TileLang from the github repository, you can run the following command:
 
 ```bash
-pip install git+https://github.com/TileLang/tile-lang.git
+pip install git+https://github.com/tile-ai/tilelang.git
 ```
 
 After installing TileLang, you can verify the installation by running:
@@ -56,7 +56,7 @@ sudo apt-get install -y python3 python3-dev python3-setuptools gcc libtinfo-dev
 After installing the prerequisites, you can clone the TileLang repository and install it using pip:
 
 ```bash
-git clone --recursive https://github.com/TileLang/tile-lang.git
+git clone --recursive https://github.com/tile-ai/tilelang.git
 cd TileLang
 pip install .  # Please be patient, this may take some time.
 ```
@@ -80,7 +80,7 @@ If you already have a compatible TVM installation, follow these steps:
 1. **Clone the Repository:**
 
     ```bash
-    git clone --recursive https://github.com/TileLang/tile-lang
+    git clone --recursive https://github.com/tile-ai/tilelang
     cd TileLang
     ```
 
@@ -114,7 +114,7 @@ If you prefer to use the built-in TVM version, follow these instructions:
 1. **Clone the Repository:**
 
     ```bash
-    git clone --recursive https://github.com/TileLang/tile-lang
+    git clone --recursive https://github.com/tile-ai/tilelang
     cd TileLang
     ```
 
@@ -152,7 +152,7 @@ For a simplified installation, use the provided script:
 1. **Clone the Repository:**
 
     ```bash
-    git clone --recursive https://github.com/TileLang/tile-lang
+    git clone --recursive https://github.com/tile-ai/tilelang
     cd TileLang
     ```
 
 
@@ -0,0 +1,94 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+import tilelang
+import tilelang.language as T
+# `make_mma_swizzle_layout` is a python defined layout function
+# specifically designed for for MMA operations
+# which ensures the consistency with the nvidia CUTLASS Library.
+# to avoid bank conflicts and maximize the performance.
+from tilelang.intrinsics import (
+    make_mma_swizzle_layout as make_swizzle_layout,)  # noqa: F401
+
+
+def matmul(M, N, K, block_M, block_N, block_K, dtype="float16", accum_dtype="float"):
+    # add decorator @tilelang.jit if you want to return a torch function
+    @T.prim_func
+    def main(
+            A: T.Buffer((M, K), dtype),
+            B: T.Buffer((K, N), dtype),
+            C: T.Buffer((M, N), dtype),
+    ):
+        # Kernel configuration remains similar
+        with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=128) as (bx, by):
+            A_shared = T.alloc_shared((block_M, block_K), dtype)
+            B_shared = T.alloc_shared((block_K, block_N), dtype)
+            C_local = T.alloc_fragment((block_M, block_N), accum_dtype)
+
+            # Apply layout optimizations or define your own layout (Optional)
+            # If not specified, we will deduce the layout automatically
+            # T.annotate_layout({
+            #     A_shared: make_swizzle_layout(A_shared),
+            #     B_shared: make_swizzle_layout(B_shared),
+            # })
+
+            # Enable rasterization for better L2 cache locality (Optional)
+            # T.use_swizzle(panel_size=10, enable=True)
+
+            # Clear local accumulation
+            T.clear(C_local)
+
+            for k in T.Pipelined(T.ceildiv(K, block_K), num_stages=3):
+                # Copy tile of A
+                # This is a sugar syntax for parallelized copy
+                T.copy(A[by * block_M, k * block_K], A_shared)
+
+                # Demonstrate parallelized copy from global to shared for B
+                for ko, j in T.Parallel(block_K, block_N):
+                    B_shared[ko, j] = B[k * block_K + ko, bx * block_N + j]
+
+                # Perform a tile-level GEMM on the shared buffers
+                # Currently we dispatch to the cute/hip on Nvidia/AMD GPUs
+                T.gemm(A_shared, B_shared, C_local)
+
+            # Copy result back to global memory
+            T.copy(C_local, C[by * block_M, bx * block_N])
+
+    return main
+
+
+# 1. Define the kernel (matmul) and compile/lower it into an executable module
+func = matmul(1024, 1024, 1024, 128, 128, 32)
+
+# 2. Compile the kernel into a torch function
+# out_idx specifies the index of the output buffer in the argument list
+# if out_idx is specified, the tensor will be created during runtime
+# target currently can be "cuda" or "hip" or "cpu".
+jit_kernel = tilelang.JITKernel(func, out_idx=[2], target="cuda")
+
+# 3. Test the kernel in Python with PyTorch data
+import torch
+
+# Create random input tensors on the GPU
+a = torch.randn(1024, 1024, device="cuda", dtype=torch.float16)
+b = torch.randn(1024, 1024, device="cuda", dtype=torch.float16)
+
+# Run the kernel through the Profiler
+c = jit_kernel(a, b)
+
+# Reference multiplication using PyTorch
+ref_c = a @ b
+
+# Validate correctness
+torch.testing.assert_close(c, ref_c, rtol=1e-2, atol=1e-2)
+print("Kernel output matches PyTorch reference.")
+
+# 4. Retrieve and inspect the generated CUDA source (optional)
+cuda_source = jit_kernel.get_kernel_source()
+print("Generated CUDA kernel:\n", cuda_source)
+
+# 5.Pofile latency with kernel
+profiler = jit_kernel.get_profiler()
+
+latency = profiler.do_bench()
+
+print(f"Latency: {latency} ms")