Skip to content

Commit

Permalink
Fine tune readme. (#111)
Browse files Browse the repository at this point in the history
* change readme test=develop

* change readme test=develop

* change readme test=develop

* add test=develop

* add test=develop
  • Loading branch information
gongweibao authored Jun 4, 2020
1 parent 00c717a commit 90dbd6e
Showing 1 changed file with 45 additions and 44 deletions.
89 changes: 45 additions & 44 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,43 +2,64 @@

<img src="https://github.com/elasticdeeplearning/artwork/blob/master/horizontal/color/edl-horizontal-color.png" width="500" style="display:inline;vertical-align:middle;padding:2%">

EDL is an Elastic Deep Learning framework designed to help deep learning cloud service providers to build cluster cloud services using deep learning framework PaddlePaddle.
# Motivation
Elastic Deep Learning(EDL) is a framework with the ability to dynamically adjust the parallelism (number of training workers) for deep neural network training. It can support multi-tenant cluster management to balance job completion time and job waiting time, maximize the use of idle resources, and so on.

EDL includes two parts:
This project contains EDL framework and its applications such as distillation and NAS.

1. A Kubernetes controller for the elastic scheduling of distributed
deep learning jobs and tools for adjusting manually.
Now EDL is an incubation-stage project of the [LF AI Foundation](https://lfai.foundation).

1. Making PaddlePaddle a fault-tolerable deep learning framework with usability API for job management.
<img src="https://github.com/lfai/artwork/blob/master/lfai-project-badge/incubation/color/lfai-projectlogos_incubation-color.png" width="200" style="display:inline;vertical-align:middle;padding:2%">

EDL is an incubation-stage project of the [LF AI Foundation](https://lfai.foundation).
# Installation
You can install with ```pip install paddle_edl```. But we highly **recommend** you use it in our docker:

<img src="https://github.com/lfai/artwork/blob/master/lfai-project-badge/incubation/color/lfai-projectlogos_incubation-color.png" width="200" style="display:inline;vertical-align:middle;padding:2%">
```
docker pull hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda10.0-cudnn7
nvidia-docker run -name paddle_edl hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda10.0-cudnn7 /bin/bash
```

# EDL Applications:

While many hardware and software manufacturers are working on
improving the running time of deep learning jobs, EDL optimizes
<p align="center">
<img src="doc/distill.gif" width="700">
</p>

1. the global utilization of the cluster, and
1. the waiting time of job submitters.
## Quick Start
- [Run EDL distillation training demo on Kubernetes or a single node](./example/distill/README.md)

## Key Features:
- Efficiency: Provides parallelism strategies to minimize adjustment overheads.
- Consistency: Accuracy verification on multiple models compared those without scaling.
- Flexibility: Any components can be killed or joined at any time.
- Easy to use: Few lines of code need to be added to support EDL.
# EDL Framework
## How to change from a normal train program to an EDL train program
The main change is that you should `load_checkpoint` at the beginning of training and `save_checkpoint` at the end of every epoch and the checkpoint should be on a distributed file system such as HDFS so all trainers can download from it. A complete example is [here](https://github.com/elasticdeeplearning/edl/tree/develop/example/collective/resnet50)

## Quick start demo: EDL Resnet50 experiments on a single machine:
We highly **recommand** you run it in our docker:
```
fs=HDFSClient(args.hdfs_name, args.hdfs_ugi,20*60*1000, 3 * 1000)
train_status =TrainStatus()
tmp_s = fleet.load_checkpoint(exe, args.checkpoint, fs=fs, trainer_id=trainer_id)
if tmp_s is not None:
train_status = tmp_s
for pass_id in range(train_status.next(), params["num_epochs"]):
train()
if trainer_id == 0:
saved_status = TrainStatus(pass_id)
fleet.save_checkpoint(exe, train_status=saved_status,
path=args.checkpoint, fs=fs)
```

1. Start a Jobserver on one node.
## Quickstart
### EDL Resnet50 experiments on a single machine in docker:

1. Start a JobServer on one node which generates changing scripts.

```
docker pull hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda10.0-cudnn7
cd example/demo/collective
cd example/demo/collective
./start_job_server.sh
```

2. Start a Jobclient which controls the worker process.
1. Start a Jobclient which controls the worker process.

```
#Set the ImageNet data path
Expand All @@ -50,32 +71,12 @@ mkdir -p resnet50_pod
./start_job_client.sh
```

3. Experiments result
1. Experiments result

| total batch size | acc1 | acc5 |
| :-----: | ----: | ----: |
| 1024 | 76.0 | 75.8 |


## Design Docs
- A scheduler on Kubernetes:
- [Scheduler](./doc/edl_design_doc.md)
- EDL framework on PaddlePaddle:
- [Fault-Tolerant Training in PaddlePaddle](./doc/fault_tolerance.md)
- [EDL framework](./doc/edl_collective_design_doc.md)

## Applications:

<p align="center">
<img src="doc/distill.gif" width="700">
</p>
| 1024 | 75.5 | 92.8 |

- EDL Distillation:
- [EDL Distillation design](./doc/edl_distill_design_doc.md)
- [Run EDL distillation training demo on Kubernetes or a single node](./example/distill/README.md)
- [EDL Distillation performance: Resnet50](./doc/experiment/distill_resnet50.md)
- EDL CTR
- [EDL CTR training and deployment on Baidu Cloud](./example/ctr/deploy_ctr_on_baidu_cloud_cn.rst)

## FAQ

Expand Down

0 comments on commit 90dbd6e

Please sign in to comment.