NineToothed is a Triton-based domain-specific language (DSL). By introducing tensor-oriented meta-programming (TOM), it makes writing high-performance GPU kernels easier.
We can use pip to install ninetoothed.
pip install ninetoothedAfter successfully running the above command, ninetoothed will be installed. However, to fully utilize its capabilities, you also need to install a deep learning framework supported by ninetoothed. For trial purposes, we recommend installing torch.
Thanks to tensor-oriented meta-programming, NineToothed can be written using the arrange-and-apply paradigm, which involves separately defining arrangement, application, and tensors, and then integrating them using ninetoothed.make to generate the kernel.
Here is the code we need for matrix multiplication:
import ninetoothed
import ninetoothed.language as ntl
from ninetoothed import Tensor, block_size
BLOCK_SIZE_M = block_size()
BLOCK_SIZE_N = block_size()
BLOCK_SIZE_K = block_size()
def arrangement(input, other, output):
    output_arranged = output.tile((BLOCK_SIZE_M, BLOCK_SIZE_N))
    input_arranged = input.tile((BLOCK_SIZE_M, BLOCK_SIZE_K))
    input_arranged = input_arranged.tile((1, -1))
    input_arranged = input_arranged.expand((-1, output_arranged.shape[1]))
    input_arranged.dtype = input_arranged.dtype.squeeze(0)
    other_arranged = other.tile((BLOCK_SIZE_K, BLOCK_SIZE_N))
    other_arranged = other_arranged.tile((-1, 1))
    other_arranged = other_arranged.expand((output_arranged.shape[0], -1))
    other_arranged.dtype = other_arranged.dtype.squeeze(1)
    return input_arranged, other_arranged, output_arranged
def application(input, other, output):
    accumulator = ntl.zeros(output.shape, dtype=ntl.float32)
    for k in range(input.shape[0]):
        accumulator += ntl.dot(input[k], other[k])
    output = accumulator
tensors = (Tensor(2), Tensor(2), Tensor(2))
kernel = ninetoothed.make(arrangement, application, tensors)This project is distributed under the Apache-2.0 license. See the included LICENSE file for details.
