Fine tune readme. (#111)

* change readme test=develop * change readme test=develop * change readme test=develop * add test=develop * add test=develop
elasticdeeplearning · Jun 4, 2020 · 90dbd6e · 90dbd6e
1 parent 00c717a
commit 90dbd6e
Showing 1 changed file with 45 additions and 44 deletions.
diff --git a/README.md b/README.md
@@ -2,43 +2,64 @@
 
 <img src="https://github.com/elasticdeeplearning/artwork/blob/master/horizontal/color/edl-horizontal-color.png" width="500" style="display:inline;vertical-align:middle;padding:2%">
 
-EDL is an Elastic Deep Learning framework designed to help deep learning cloud service providers to build cluster cloud services using deep learning framework PaddlePaddle.
+# Motivation
+Elastic Deep Learning(EDL) is a framework with the ability to dynamically adjust the parallelism (number of training workers) for deep neural network training. It can support multi-tenant cluster management to balance job completion time and job waiting time, maximize the use of idle resources, and so on.
 
-EDL includes two parts:
+This project contains EDL framework and its applications such as distillation and NAS.
 
-1. A Kubernetes controller for the elastic scheduling of distributed
-   deep learning jobs and tools for adjusting manually.
+Now EDL is an incubation-stage project of the [LF AI Foundation](https://lfai.foundation).
 
-1. Making PaddlePaddle a fault-tolerable deep learning framework with usability API for job management.
+<img src="https://github.com/lfai/artwork/blob/master/lfai-project-badge/incubation/color/lfai-projectlogos_incubation-color.png"  width="200" style="display:inline;vertical-align:middle;padding:2%">
 
-EDL is an incubation-stage project of the [LF AI Foundation](https://lfai.foundation).
+# Installation
+You can install with ```pip install paddle_edl```. But we highly **recommend** you use it in our docker:
 
-<img src="https://github.com/lfai/artwork/blob/master/lfai-project-badge/incubation/color/lfai-projectlogos_incubation-color.png"  width="200" style="display:inline;vertical-align:middle;padding:2%">
+```
+docker pull hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda10.0-cudnn7
+nvidia-docker run -name paddle_edl hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda10.0-cudnn7 /bin/bash
+```  
+
+# EDL Applications:
 
-While many hardware and software manufacturers are working on
-improving the running time of deep learning jobs, EDL optimizes
+<p align="center">
+    <img src="doc/distill.gif" width="700">
+</p>
 
-1. the global utilization of the cluster, and
-1. the waiting time of job submitters.
+## Quick Start
+- [Run EDL distillation training demo on Kubernetes or a single node](./example/distill/README.md)
 
-## Key Features:
-- Efficiency: Provides parallelism strategies to minimize adjustment overheads.
-- Consistency: Accuracy verification on multiple models compared those without scaling.
-- Flexibility: Any components can be killed or joined at any time.
-- Easy to use: Few lines of code need to be added to support EDL.
+# EDL Framework
+## How to change from a normal train program to an EDL train program
+The main change is that you should `load_checkpoint` at the beginning of training and `save_checkpoint` at the end of every epoch and the checkpoint should be on a distributed file system such as HDFS so all trainers can download from it. A complete example is [here](https://github.com/elasticdeeplearning/edl/tree/develop/example/collective/resnet50)
 
-## Quick start demo: EDL Resnet50 experiments on a single machine:
-We highly **recommand** you run it in our docker:  
+```
+fs=HDFSClient(args.hdfs_name, args.hdfs_ugi,20*60*1000, 3 * 1000)
+        
+train_status =TrainStatus()
+tmp_s = fleet.load_checkpoint(exe, args.checkpoint, fs=fs, trainer_id=trainer_id)
+if tmp_s is not None:
+   train_status = tmp_s
+        
+for pass_id in range(train_status.next(), params["num_epochs"]):
+    train()
+    
+    if trainer_id == 0:
+        saved_status = TrainStatus(pass_id)    
+        fleet.save_checkpoint(exe, train_status=saved_status,
+            path=args.checkpoint, fs=fs)
+```
 
-1. Start a Jobserver on one node.
+## Quickstart
+### EDL Resnet50 experiments on a single machine in docker:
+
+1. Start a JobServer on one node which generates changing scripts.
 
 ```
-docker pull hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda10.0-cudnn7
-cd example/demo/collective
+cd example/demo/collective	
 ./start_job_server.sh
 ```
 
-2. Start a Jobclient which controls the worker process.
+1. Start a Jobclient which controls the worker process.
 
 ```
 #Set the ImageNet data path
@@ -50,32 +71,12 @@ mkdir -p resnet50_pod
 ./start_job_client.sh
 ```
 
-3. Experiments result
+1. Experiments result
 
 | total batch size | acc1 | acc5 |
 | :-----: | ----: | ----: |
-| 1024 | 76.0 | 75.8 |
-
-
-## Design Docs
-- A scheduler on Kubernetes:
-  -  [Scheduler](./doc/edl_design_doc.md)
-- EDL framework on PaddlePaddle:
-  -  [Fault-Tolerant Training in PaddlePaddle](./doc/fault_tolerance.md)
-  -  [EDL framework](./doc/edl_collective_design_doc.md)
-
-## Applications:
-
-<p align="center">
-    <img src="doc/distill.gif" width="700">
-</p>
+| 1024 | 75.5 | 92.8 |
 
-- EDL Distillation:
-  - [EDL Distillation design](./doc/edl_distill_design_doc.md)
-  - [Run EDL distillation training demo on Kubernetes or a single node](./example/distill/README.md)
-  - [EDL Distillation performance: Resnet50](./doc/experiment/distill_resnet50.md)
-- EDL CTR
-  - [EDL CTR training and deployment on Baidu Cloud](./example/ctr/deploy_ctr_on_baidu_cloud_cn.rst)
 
 ## FAQ