Highlights

We are excited to release torch_musa v1.2.1 based on PyTorch v2.0.0. In this release, we support some basic and important features, including torch_musa profiler, musa_extension, musa_convert, codegen and compare_tool. In addition, we now have adapted more than 600 operators. With these basic features and operators, torch_musa could support a large number of models in various fields, including the recently popular large language models. The number of supported operators and models is increasing rapidly. With torch_musa, users can easily accelerate AI applications on Moore Threads graphics cards.

This release is due to the efforts of engineers in Moore Threads AI Team and other departments. We sincerely hope that everyone can continue to pay attention to our work and participate in it, and witness the fast iteration of torch_musa and Moore Threads graphics cards together.

New Features

torch_musa profiler

We have adapted pytorch's official performance analysis tool, torch.profiler. Users can use this adapted tool to analyze the performance details of pytorch model training or inference tasks running on the MUSA platform. It can capture information about operators called at the host level or kernels executed on the GPU device.

musa_extension

We have implemented the MUSAExtension interface, which is consistent with CUDAExtension. It can be used to build customized operators based on the MUSA platform, making full use of GPU resources to accelerate calculations. Many pytorch third-party ecological libraries that utilize CUDAExtension can also be easily ported to the MUSA platform.

musa_converter

We have developed a convert tool named musa_converter that translates pytorch-cuda related strings and APIS in PyTorch scripts into torch_musa compatible code, which improve the efficiency of model migration from CUDA platform to MUSA platform. Users can run musa_converter -h to see the usage of musa_converter.

codegen

We introduce the codegen module to handle the automatic binding and registration of customized musa kernels. It extends from torchgen, follows the format patterns of native_functions.yaml file, also supports different custom strategies, which can significantly reduce the workload of developers.

compare_tool

This tool is designed to enhance the debugging and validation process of PyTorch models by offering capabilities for comparing tensor operations across devices, tracking module hierarchies, and detecting the presence of NaN/Inf values. It is aimed at ensuring the correctness and stability of models through various stages of development and testing.

operator_benchmark

We followed PyTorch operator_benchmark suite and adapted it into torch_musa. Developers can utilize it the same way as in PyTorch. It helps developers to generate fully characterized performance of an operator, and developers can compare result with the one generated from CUDA or other accelerate backends, continuously improve the performances of torch_musa.

Enhancements

Operators support

1.Support operators including torch.mode, torch.count_nonzero, torch.sort(stable=True), torch.upsample2d/3d, torch.logical_or/and etc.

2.Support more dtypes for torch.scatter, torch.eq, torch.histc, torch.logsumexp etc.

Operators and modules optimize

1.Optimize and accelerate operators like Indexing kernels, embedding kernels, torch.nonzero, torch.unique, torch.clamp etc.

2.Enable manual seed setting for dropout layer.

3.Support SDP(scale-dot production) with GQA(grouped-query attention) and causal mask.

4.Now the AMP usage is aligned with CUDA as torch.autocast would automatically enable torch_musa amp.

Documentation

We provide developer documentation for developers, which describes the development environment preparation and some development steps in detail.

Dockers

We provide Release docker image and development docker image.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch_musa Release v1.2.1