Skip to content

Releases: MooreThreads/torch_musa

torch_musa Release v1.2.1

07 Aug 09:35
7dcb8b2
Compare
Choose a tag to compare

Highlights

We are excited to release torch_musa v1.2.1 based on PyTorch v2.0.0. In this release, we support some basic and important features, including torch_musa profiler, musa_extension, musa_convert, codegen and compare_tool. In addition, we now have adapted more than 600 operators. With these basic features and operators, torch_musa could support a large number of models in various fields, including the recently popular large language models. The number of supported operators and models is increasing rapidly. With torch_musa, users can easily accelerate AI applications on Moore Threads graphics cards.

This release is due to the efforts of engineers in Moore Threads AI Team and other departments. We sincerely hope that everyone can continue to pay attention to our work and participate in it, and witness the fast iteration of torch_musa and Moore Threads graphics cards together.

New Features

torch_musa profiler

We have adapted pytorch's official performance analysis tool, torch.profiler. Users can use this adapted tool to analyze the performance details of pytorch model training or inference tasks running on the MUSA platform. It can capture information about operators called at the host level or kernels executed on the GPU device.

musa_extension

We have implemented the MUSAExtension interface, which is consistent with CUDAExtension. It can be used to build customized operators based on the MUSA platform, making full use of GPU resources to accelerate calculations. Many pytorch third-party ecological libraries that utilize CUDAExtension can also be easily ported to the MUSA platform.

musa_converter

We have developed a convert tool named musa_converter that translates pytorch-cuda related strings and APIS in PyTorch scripts into torch_musa compatible code, which improve the efficiency of model migration from CUDA platform to MUSA platform. Users can run musa_converter -h to see the usage of musa_converter.

codegen

We introduce the codegen module to handle the automatic binding and registration of customized musa kernels. It extends from torchgen, follows the format patterns of native_functions.yaml file, also supports different custom strategies, which can significantly reduce the workload of developers.

compare_tool

This tool is designed to enhance the debugging and validation process of PyTorch models by offering capabilities for comparing tensor operations across devices, tracking module hierarchies, and detecting the presence of NaN/Inf values. It is aimed at ensuring the correctness and stability of models through various stages of development and testing.

operator_benchmark

We followed PyTorch operator_benchmark suite and adapted it into torch_musa. Developers can utilize it the same way as in PyTorch. It helps developers to generate fully characterized performance of an operator, and developers can compare result with the one generated from CUDA or other accelerate backends, continuously improve the performances of torch_musa.

Enhancements

Operators support

1.Support operators including torch.mode, torch.count_nonzero, torch.sort(stable=True), torch.upsample2d/3d, torch.logical_or/and etc.

2.Support more dtypes for torch.scatter, torch.eq, torch.histc, torch.logsumexp etc.

Operators and modules optimize

1.Optimize and accelerate operators like Indexing kernels, embedding kernels, torch.nonzero, torch.unique, torch.clamp etc.

2.Enable manual seed setting for dropout layer.

3.Support SDP(scale-dot production) with GQA(grouped-query attention) and causal mask.

4.Now the AMP usage is aligned with CUDA as torch.autocast would automatically enable torch_musa amp.

Documentation

We provide developer documentation for developers, which describes the development environment preparation and some development steps in detail.

Dockers

We provide Release docker image and development docker image.

torch_musa Release v1.1.0

14 Mar 04:52
1a14c97
Compare
Choose a tag to compare

torch_musa Release Notes

  • Highlights
  • New Features
    • AMP mixed precision training
    • MUSAExtension
    • Pinned memory
    • TensorCore computation
    • CompareTool [Experimental]
  • Supported Operators
  • Documentation
  • Dockers

Highlights

We are excited to release torch_musa v1.1.0 based on PyTorch v2.0.0. In this release, we support more import features, including AMP mixed precision training, MUSAExtension, TensorCore computation, pinned memory and CompareTool. In addition, we have adapted more than 470 operators, improved DDP module and implemented more quantization operators. With torch_musa, users can easily accelerate AI applications on Moore Threads graphics cards.

This release is due to the efforts of engineers in Moore Threads AI Team and other departments. We sincerely hope that everyone can continue to pay attention to our work and participate in it, and witness the fast iteration of torch_musa and Moore Threads graphics cards together.

New Features

AMP mixed precision training

Now we support mixed precision training of BF16 and FP16. However, it is worth noting that S80 and S3000 only support fp16, while S4000 supports both fp16 and bf16, and the interface is completely consistent with PyTorch. Users can use AMP like the following code:

# low_dtype can be torch.float16 or torch.bfloat16
def train_in_amp(low_dtype=torch.float16):
    set_seed()
    model = SimpleModel().to(DEVICE)
    criterion = nn.MSELoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

    # create the scaler object
    scaler = torch.musa.amp.GradScaler()

    inputs = torch.randn(6, 5).to(DEVICE)  # 将数据移至GPU
    targets = torch.randn(6, 3).to(DEVICE)
    for step in range(20):
        optimizer.zero_grad()
        # create autocast environment
        with torch.musa.amp.autocast(dtype=low_dtype):
            outputs = model(inputs)
            assert outputs.dtype == low_dtype
            loss = criterion(outputs, targets)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
    return loss

MUSAExtension

MUSAExtension and CUDAExtension are basically the same, except that MUSAExtension needs to manually add a dynamic library to the dynamic library search path. For detailed usage, please refer to torch_musa/torch_musa/utils/README.md and the developer documentation. This issue will be resolved in the next version.

Pinned memory

Pinned memory now is supported by torch_musa, the following code can utilize it.

cpu_tensor = torch.rand(shape, dtype=torch.float32).pin_memory("musa")
gpu_tensor = cpu_tensor.to("musa", non_blocking=True)

TensorCore computation

The S4000 has tensorcore, therefore it supports TF32 format calculations. Users can utilize TF32 for acceleration using the following code:

with torch.backends.mudnn.flags(allow_tf32=True):
      # your train code.

CompareTool [Experimental]

CompareTool is an experimental tool aimed at automatically comparing the computation results between musa and cpu, thereby facilitating the debugging process. For detailed usage, please refer to torch_musa/utils/README.md

Supported Operators

More than 470 operators are supported in torch_musa.

Documentation

We provide developer guide for developers, which describes the development environment preparation and some development steps in detail.

Dockers

Release docker image and development docker image are available now.

[NOTE]: If you want to compile torch_musa without using the provided docker image, please download the rc2.0.0 Intel CPU_Ubuntu underlying software stack in https://developer.mthreads.com/sdk/download/musa?equipment=&os=&driverVersion=&version=

[NOTE]:

- When installing following released whl package, please remove the device name. For example,
- pip install torch-2.0.0-cp310-cp310-linux_x86_64.whl

torch_musa Release v1.0.0

11 Jul 03:42
a2cf2f5
Compare
Choose a tag to compare

torch_musa Release Notes

  • Highlights
  • New Features
    • CUDA Kernels Porting
    • Caching Allocator
    • Device Management
    • Distributed Data Parallel Training [Experimental]
    • FP16 Inference [Experimental]
  • Supported Operators
  • Supported Models
  • Documentation
  • Dockers

Highlights

We are excited to release torch_musa v1.0.0 based on PyTorch v2.0.0. In this release, we support some basic and important features, including CUDA kernels porting, device management, memory allocator, distributed data parallel training(experimental) and FP16 inference(experimental). In addition, we have adapted more than 300 operators. With these basic features and operators, torch_musa could support a large number of models in various fields, including the recently popular large language models. The number of supported operators and models is increasing rapidly. With torch_musa, users can easily accelerate AI applications on Moore Threads graphics cards.

This release is due to the efforts of engineers in Moore Threads AI Team and other departments. We sincerely hope that everyone can continue to pay attention to our work and participate in it, and witness the fast iteration of torch_musa and Moore Threads graphics cards together.

New Features

CUDA Kernels Porting

Thanks to CUDA-compatible capabilities of our MUSA software stack, torch_musa can easily support CUDA-compatible modules. It then effectively enables developers to reuse CUDA kernels with a small amount of efforts, which greatly speeds up operators adaptation.

Caching Allocator

The amount of required memory is constantly changing during the program execution. Frqeuent invocations of memory allocation and deallocation (through musaMalloc and musaFree) usually lead to high execution cost. To alleviate this issue, we implemented caching allocator that requests memory blocks from MUSA and strategically splits and reuses these blocks without returning them to MUSA, which results in a significant performance gain.

Device Management

In order to manage devices, three components are implemented in torch_musa, including device streams, device events and device generators. Device streams are used to manage and synchronize launched kernels. Device event is an important component related to streams, which records a specific point in the execution of a stream. Device generators are used to generate random numbers. Devices are initialized lazily, which could improve startup especially for multi-GPU systems.

Distributed Data Parallel Training [Experimental]

As the number of model parameters increases, especially for the large language models, distributed data parallel training becomes increasingly important. torch_musa has already started supporting distributed data parallel training. Some important communication primitives are already supported, including send, recv, broadcast, all_reduce, reduce, all_gather, gather, scatter, reduce_scatter and barrier. The interface torch.nn.parallel.DistributedDataParallel is also supported. This module is under rapid development.

FP16 Inference [Experimental]

To speed up model inference, we currently supported a series of FP16 operators, including linear, matmul, unary ops, binary ops, layernorm and most porting kernels. With this set of operators, we are able to run FP16 inference on a number of models. Please note this feature is still experimental, the model support might be limited.

Supported Operators

More than 300 operators are supported in torch_musa.

Supported Models

Many classic and popular models are already supported, including Stable Diffusion, ChatGLM, Conformer, Bert, YOLOV5, ResNet50, Swin-Transformer, MobileNetv3, EfficientNet, HRNet, TSM, FastSpeech2, UNet, T5, HifiGan, Real-EsrGan, OpenPose, many GPT variants and so on.

Documentation

We provide developer guide for developers, which describes the development environment preparation and some development steps in detail.

Dockers

Release docker image and development docker image are available now.

[NOTE]: If you want to compile torch_musa without using the provided docker image, please contact us to get the necessary dependencies by email developers@mthreads.com.