torch_musa Release v1.2.1
Highlights
We are excited to release torch_musa v1.2.1 based on PyTorch v2.0.0. In this release, we support some basic and important features, including torch_musa profiler
, musa_extension
, musa_convert
, codegen
and compare_tool
. In addition, we now have adapted more than 600 operators. With these basic features and operators, torch_musa
could support a large number of models in various fields, including the recently popular large language models. The number of supported operators and models is increasing rapidly. With torch_musa
, users can easily accelerate AI applications on Moore Threads graphics cards.
This release is due to the efforts of engineers in Moore Threads AI Team and other departments. We sincerely hope that everyone can continue to pay attention to our work and participate in it, and witness the fast iteration of torch_musa and Moore Threads graphics cards together.
New Features
torch_musa profiler
We have adapted pytorch's official performance analysis tool, torch.profiler
. Users can use this adapted tool to analyze the performance details of pytorch model training or inference tasks running on the MUSA platform. It can capture information about operators called at the host level or kernels executed on the GPU device.
musa_extension
We have implemented the MUSAExtension
interface, which is consistent with CUDAExtension
. It can be used to build customized operators based on the MUSA platform, making full use of GPU resources to accelerate calculations. Many pytorch third-party ecological libraries that utilize CUDAExtension
can also be easily ported to the MUSA platform.
musa_converter
We have developed a convert tool named musa_converter that translates pytorch-cuda related strings and APIS in PyTorch scripts into torch_musa compatible code, which improve the efficiency of model migration from CUDA platform to MUSA platform. Users can run musa_converter -h
to see the usage of musa_converter
.
codegen
We introduce the codegen module to handle the automatic binding and registration of customized musa kernels. It extends from torchgen, follows the format patterns of native_functions.yaml file, also supports different custom strategies, which can significantly reduce the workload of developers.
compare_tool
This tool is designed to enhance the debugging and validation process of PyTorch models by offering capabilities for comparing tensor operations across devices, tracking module hierarchies, and detecting the presence of NaN/Inf values. It is aimed at ensuring the correctness and stability of models through various stages of development and testing.
operator_benchmark
We followed PyTorch operator_benchmark suite and adapted it into torch_musa
. Developers can utilize it the same way as in PyTorch. It helps developers to generate fully characterized performance of an operator, and developers can compare result with the one generated from CUDA or other accelerate backends, continuously improve the performances of torch_musa
.
Enhancements
Operators support
1.Support operators including torch.mode
, torch.count_nonzero
, torch.sort(stable=True)
, torch.upsample2d/3d
, torch.logical_or/and
etc.
2.Support more dtypes for torch.scatter
, torch.eq
, torch.histc
, torch.logsumexp
etc.
Operators and modules optimize
1.Optimize and accelerate operators like Indexing kernels, embedding kernels, torch.nonzero
, torch.unique
, torch.clamp
etc.
2.Enable manual seed setting for dropout layer.
3.Support SDP(scale-dot production) with GQA(grouped-query attention) and causal mask.
4.Now the AMP usage is aligned with CUDA as torch.autocast
would automatically enable torch_musa amp.
Documentation
We provide developer documentation for developers, which describes the development environment preparation and some development steps in detail.
Dockers
We provide Release docker image and development docker image.