Skip to content

Commit

Permalink
Merge pull request intelligent-machine-learning#328 from workingloong…
Browse files Browse the repository at this point in the history
…/torchrun-tutorial

Tutorial to introduce how to submit a torch job.
  • Loading branch information
workingloong authored Mar 21, 2023
2 parents a872f36 + 77194fa commit 641a5ad
Show file tree
Hide file tree
Showing 2 changed files with 83 additions and 0 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,3 +95,5 @@ By practice, DLRover is an ideal component to build an end-to-end industrial onl
## Quick Start

[Train a TensorFlow Estimator on Aliyun ACK](docs/tutorial/ps_on_cloud.md)

[Train a PyTorch Model on Aliyun ACK](docs/tutorial/torch_allreduce_on_cloud.md)
81 changes: 81 additions & 0 deletions docs/tutorial/torch_allreduce_on_cloud.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# AllReduce Training Using DLRover on Public Cloud

This document explains how to run a DLRover elastic job using torchrun
on a public cloud, namely, Alibaba Cloud Container Service for Kubernetes(ACK).

## Preliminary

- Create a Kubernetes cluster on [ACK](https://help.aliyun.com/document_detail/309552.htm?spm=a2c4g.11186623.0.0.168f6b7aegH7nI#task-2112671).
- Configure cluster credentials on your local computer.
- Create a [NAS](https://help.aliyun.com/document_detail/477380.html?spm=a2c4g.11186623.0.0.10635c83Xn7Tkh) storage and mount it to the cluster.

## Deploy the ElasticJob CRD on ACK

1. Deploy the controller on the cluster.

```bash
make deploy IMG=easydl/elasticjob-controller:v0.1.1
```

2. Grant permission for the DLRover master to Access CRDs.

```bash
kubectl -n dlrover apply -f dlrover/go/operator/config/rbac/default_role.yaml
```

## Submit a Job

- Submit a job to train a CNN model with MNIST dataset.

```bash
kubectl -n dlrover apply -f dlrover/examples/torch_mnist_job.yaml
```

- Check the job status

```bash
kubectl -n dlrover get elasticjob torch-mnist
```

```bash
NAME PHASE AGE
torch-mnist Running 19h
```

- Check the Pod status

```bash
kubectl -n dlrover get pods -l elasticjob-name=deepctr-auto-scaling-job
```

```bash
NAME READY STATUS RESTARTS AGE
elasticjob-torch-mnist-dlrover-master 1/1 Running 0 26s
torch-mnist-edljob-worker-0 1/1 Running 0 29s
torch-mnist-edljob-worker-1 1/1 Running 0 32s
```

## Test Fault-tolerance

- Delete a worker.

```bash
kubectl -n dlrover delete pod torch-mnist-edljob-worker-1
```

Then, we can see there are only one worker.

```bash
NAME READY STATUS RESTARTS AGE
elasticjob-torch-mnist-dlrover-master 1/1 Running 0 1m12s
torch-mnist-edljob-worker-0 1/1 Running 0 1m15s
```

For a while, DLRover will restore the deleted worker.

```bash
NAME READY STATUS RESTARTS AGE
elasticjob-torch-mnist-dlrover-master 1/1 Running 0 1m52s
torch-mnist-edljob-worker-0 1/1 Running 0 1m55s
torch-mnist-edljob-worker-1 1/1 Running 0 32s
```

0 comments on commit 641a5ad

Please sign in to comment.