update readme (#19)

AlibabaPAI · Sep 12, 2024 · a4f7db9 · a4f7db9
1 parent e9a7cf8
commit a4f7db9
Show file tree

Hide file tree

Showing 6 changed files with 102 additions and 15 deletions.
diff --git a/README.md b/README.md
@@ -1,37 +1,84 @@
+[![docs](https://img.shields.io/badge/docs-latest-brightgreen.svg)](https://torchacc.readthedocs.io/en/latest/)
+[![CI](https://github.com/alibabapai/torchacc/actions/workflows/unit_test.yml/badge.svg)](https://github.com/alibabapai/torchacc/actions)
+[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/alibabapai/torchacc/blob/main/LICENSE)
+
 # TorchAcc
 
-**TorchAcc** is a PyTorch distributed training acceleration framework provided by Alibaba Cloud's PAI platform.
+**TorchAcc** is an AI training acceleration framework developed by Alibaba Cloud’s PAI.
 
-TorchAcc leverages the work of the [PyTorch/XLA](https://github.com/pytorch/xla) to provide users with training acceleration capabilities. At the same time, we have conducted a considerable amount of targeted optimization based on GPU. TorchAcc offers better usability, superior performance, and richer functionality.
+TorchAcc is built on [PyTorch/XLA](https://github.com/pytorch/xla) and provides an easy-to-use interface to accelerate the training of PyTorch models. At the same time, TorchAcc has implemented extensive optimizations for distributed training, memory management, and computation specifically for GPUs, ultimately achieving improved ease of use, better GPU training performance, and enhanced scalability for distributed training.
 
-## Highlighted Features
 
-The key features of TorchAcc:
+## Highlighted Features
 
 * Rich distributed Parallelism
     * Data Parallelism
     * Fully Sharded Data Parallelism
     * Tensor Parallelism
     * Pipeline Parallelism
-    * [Ulysess](https://arxiv.org/abs/2309.14509)
-    * [Ring Attention](https://arxiv.org/abs/2310.01889)
-    * Flash Sequence (Solution for Long Sequence)
+    * Context Parallelism
+      * [Ulysess](https://arxiv.org/abs/2309.14509)
+      * [Ring Attention](https://arxiv.org/abs/2310.01889)
+      * FlashSequence (2D Sequence Parallelism)
 * Low Memory Cost
 * High Performance
-* Ease use
+* Easy-to-use API
+
+  You can accelerate your transformer models with just a few lines of code using TorchAcc.
+
+<p align="center">
+  <img width="80%" src=docs/figures/api.gif />
+</p>
+
+
+## Architecture Overview
+The main goal of TorchAcc is to provide a high-performance AI training framework.
+It utilizes IR abstractions at different layers and employs static graph compilation optimization like XLA and dynamic graph compilation optimization like BladeDISC, as well as distributed optimization techniques, to offer a comprehensive end-to-end optimization solution from the underlying operators to the upper-level models.
+
+
+<p align="center">
+  <img width="80%" src=docs/figures/arch.png />
+</p>
+
 
 ## Installation
 
-### Build from source
-1. Build
+### Docker
 ```
-python setup.py install
+sudo docker run  --gpus all --net host --ipc host --shm-size 10G -it --rm --cap-add=SYS_PTRACE registry.cn-hangzhou.aliyuncs.com/pai-dlc/acc:r2.3.0-cuda12.1.0-py3.10 bash
 ```
 
-2. UT
-```
-sh tests/run_ut.sh
+### Build from source
+
+see the [contribution guide](docs/source/contributing.md).
+
+
+## LLMs training examples
+
+### Getting Started
+
+We present a straightforward example for training a Transformer model using TorchAcc, illustrating the usage of the TorchAcc API.
+You can quickly initiate training a Transformer model with TorchAcc by executing the following command:
+``` shell
+torchrun --nproc_per_node=4 benchmarks/transformer.py --bf16 --acc --disable_loss_print --fsdp_size=4 --gc
 ```
 
+### Utilizing HuggingFace Transformers
+
+If you are familiar with HuggingFace Transformers's Trainer, you can easily accelerate a Transformer model using TorchAcc, see the [huggingface transformers](docs/source/tutorials/hf_transformers.md)
+
+### LLMs training acceleration with FlashModels
+
+If you want to try the latest features of Torchacc or want to use the TorchAcc interface more flexibly for model acceleration, you can use our LLM acceleration library, FlashModels. It integrates various distributed implementations of commonly used open-source LLM models and provides a wealth of examples.
+
+https://github.com/AlibabaPAI/FlashModels
+
+### SFT using modelscope/swift
+coming soon..
+
+## Contributing
+see the [contribution guide](docs/source/contributing.md).
+
+
 ## License
 [Apache License 2.0](LICENSE)
diff --git a/docs/figures/api.gif b/docs/figures/api.gif
diff --git a/docs/figures/arch.png b/docs/figures/arch.png
diff --git a/docs/source/contributing.md b/docs/source/contributing.md
@@ -0,0 +1,34 @@
+# Contribute To TorchAcc
+
+
+TorchAcc is built on top of PyTorch/XLA, and it requires a specific version of PyTorch/XLA to
+to ensure GPU compatibility and performance.
+We highly recommend you to use our prebuilt Docker image to start your development work.
+
+## Building from source
+If you want to build from source, you first need to build PyTorch and torch_xla from source.
+
+1. build PyTorch
+```shell
+git clone --recursive -b v2.3.0 git@github.com:AlibabaPAI/pytorch.git
+cd pytorch
+python setup.py develop
+```
+
+
+2. build torch_xla
+```shell
+git clone --recursive -b acc git@github.com:AlibabaPAI/xla.git
+cd xla
+USE_CUDA=1 XLA_CUDA=1 python setup.py develop
+```
+
+3. build torchacc
+```shell
+python setup.py develop
+```
+
+4. UT
+```
+sh tests/run_ut.sh
+```
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -47,6 +47,12 @@ Welcome to PAI-TorchAcc's documentation!
 
    apis/modules
 
+.. toctree::
+   :maxdepth: 2
+   :caption: CONTRIBUTING
+
+   contributing
+
 Indices and tables
 ==================
 

diff --git a/docs/source/install.md b/docs/source/install.md
@@ -5,7 +5,7 @@
 It is recommended to use the existing release image directly. The image address is:
 
 ```bash
-registry.<region>.aliyuncs.com/pai-dlc/acc:r2.3.0-cuda12.1.0-py3.10-nightly
+registry.<region>.aliyuncs.com/pai-dlc/acc:r2.3.0-cuda12.1.0-py3.10
 ```
 
 Replace `<region>` with one of the following as needed: