This repository has been archived by the owner on Nov 16, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 10
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Migrate OpenPAI protocol from https://github.com/microsoft/pai.
- Loading branch information
Showing
7 changed files
with
348 additions
and
64 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
* eol=lf | ||
*.md text | ||
*.yaml text | ||
*.yml text |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
name: Lint | ||
|
||
on: | ||
push: | ||
branches: | ||
- master | ||
pull_request: | ||
branches: | ||
- master | ||
|
||
jobs: | ||
spelling: | ||
name: Spelling check | ||
runs-on: ubuntu-16.04 | ||
steps: | ||
- name: Checkout | ||
uses: actions/checkout@v1 | ||
- name: Install dependencies | ||
run: | | ||
curl -L https://git.io/misspell | sudo bash -s -- -b /bin | ||
- name: Check spelling | ||
run: | | ||
misspell -error . |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,9 @@ | ||
# Microsoft Open Source Code of Conduct | ||
|
||
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). | ||
|
||
Resources: | ||
|
||
- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/) | ||
- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) | ||
- Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns | ||
# Microsoft Open Source Code of Conduct | ||
|
||
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). | ||
|
||
Resources: | ||
|
||
- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/) | ||
- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) | ||
- Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,14 +1,28 @@ | ||
|
||
# Contributing | ||
|
||
This project welcomes contributions and suggestions. Most contributions require you to agree to a | ||
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us | ||
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com. | ||
|
||
When you submit a pull request, a CLA bot will automatically determine whether you need to provide | ||
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions | ||
provided by the bot. You will only need to do this once across all repos using our CLA. | ||
|
||
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). | ||
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or | ||
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments. | ||
OpenPAI Protocol | ||
================ | ||
|
||
[OpenPAI](https://github.com/microsoft/pai) Protocol is a specification that includes: | ||
- The resource requirement, including the docker image used by the job container, and the data used by the job, etc. | ||
- Various requirement like the GPU/CPU usage, container role and job completion policy used by [Framework Controller](https://github.com/microsoft/frameworkcontroller). | ||
- Scheduling requirement used by [Framework Controller](https://github.com/microsoft/frameworkcontroller) and [HiveD Scheduler](https://github.com/microsoft/hivedscheduler). | ||
- Runtime environment variables, if needed. | ||
|
||
OpenPAI protocol enables job sharing and portability: a job specified by the protocol can run in different OpenPAI deployment. | ||
The protocol also allows users to make a template of a job, which further facilitates the sharing and collaboration in a team that has similar but slightly different configuration for a class of jobs. | ||
With the protocol, OpenPAI introduces marketplace that hosts job or job templates that share with other people. | ||
|
||
|
||
Contributing | ||
------------ | ||
|
||
This project welcomes contributions and suggestions. Most contributions require you to agree to a | ||
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us | ||
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com. | ||
|
||
When you submit a pull request, a CLA bot will automatically determine whether you need to provide | ||
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions | ||
provided by the bot. You will only need to do this once across all repos using our CLA. | ||
|
||
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). | ||
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or | ||
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,41 +1,41 @@ | ||
<!-- BEGIN MICROSOFT SECURITY.MD V0.0.3 BLOCK --> | ||
|
||
## Security | ||
|
||
Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/). | ||
|
||
If you believe you have found a security vulnerability in any Microsoft-owned repository that meets Microsoft's [Microsoft's definition of a security vulnerability](https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc751383(v=technet.10)) of a security vulnerability, please report it to us as described below. | ||
|
||
## Reporting Security Issues | ||
|
||
**Please do not report security vulnerabilities through public GitHub issues.** | ||
|
||
Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://msrc.microsoft.com/create-report). | ||
|
||
If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc). | ||
|
||
You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc). | ||
|
||
Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue: | ||
|
||
* Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.) | ||
* Full paths of source file(s) related to the manifestation of the issue | ||
* The location of the affected source code (tag/branch/commit or direct URL) | ||
* Any special configuration required to reproduce the issue | ||
* Step-by-step instructions to reproduce the issue | ||
* Proof-of-concept or exploit code (if possible) | ||
* Impact of the issue, including how an attacker might exploit the issue | ||
|
||
This information will help us triage your report more quickly. | ||
|
||
If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://microsoft.com/msrc/bounty) page for more details about our active programs. | ||
|
||
## Preferred Languages | ||
|
||
We prefer all communications to be in English. | ||
|
||
## Policy | ||
|
||
Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://www.microsoft.com/en-us/msrc/cvd). | ||
|
||
<!-- END MICROSOFT SECURITY.MD BLOCK --> | ||
<!-- BEGIN MICROSOFT SECURITY.MD V0.0.3 BLOCK --> | ||
|
||
## Security | ||
|
||
Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/). | ||
|
||
If you believe you have found a security vulnerability in any Microsoft-owned repository that meets Microsoft's [Microsoft's definition of a security vulnerability](https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc751383(v=technet.10)) of a security vulnerability, please report it to us as described below. | ||
|
||
## Reporting Security Issues | ||
|
||
**Please do not report security vulnerabilities through public GitHub issues.** | ||
|
||
Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://msrc.microsoft.com/create-report). | ||
|
||
If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc). | ||
|
||
You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc). | ||
|
||
Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue: | ||
|
||
* Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.) | ||
* Full paths of source file(s) related to the manifestation of the issue | ||
* The location of the affected source code (tag/branch/commit or direct URL) | ||
* Any special configuration required to reproduce the issue | ||
* Step-by-step instructions to reproduce the issue | ||
* Proof-of-concept or exploit code (if possible) | ||
* Impact of the issue, including how an attacker might exploit the issue | ||
|
||
This information will help us triage your report more quickly. | ||
|
||
If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://microsoft.com/msrc/bounty) page for more details about our active programs. | ||
|
||
## Preferred Languages | ||
|
||
We prefer all communications to be in English. | ||
|
||
## Policy | ||
|
||
Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://www.microsoft.com/en-us/msrc/cvd). | ||
|
||
<!-- END MICROSOFT SECURITY.MD BLOCK --> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,146 @@ | ||
# OpenPAI Job Protocol YAML Example for a Distributed TensorFlow Job | ||
|
||
protocolVersion: 2 | ||
name: tensorflow_cifar10 | ||
type: job | ||
version: 1.0 | ||
contributor: Alice | ||
description: image classification, cifar10 dataset, tensorflow, distributed training | ||
|
||
prerequisites: | ||
- protocolVersion: 2 | ||
name: tf_example | ||
type: dockerimage | ||
version: latest | ||
contributor: Alice | ||
description: python3.5, tensorflow | ||
auth: | ||
username: user | ||
password: <% $secrets.docker_password %> | ||
registryuri: openpai.azurecr.io | ||
uri: openpai/pai.example.tensorflow | ||
- protocolVersion: 2 | ||
name: tensorflow_cifar10_model | ||
type: output | ||
version: latest | ||
contributor: Alice | ||
description: cifar10 data output | ||
uri: hdfs://10.151.40.179:9000/core/cifar10_model | ||
- protocolVersion: 2 | ||
name: tensorflow_cnnbenchmarks | ||
type: script | ||
version: 84820935288cab696c9c2ac409cbd46a1f24723d | ||
contributor: MaggieQi | ||
description: tensorflow benchmarks | ||
uri: github.com/MaggieQi/benchmarks | ||
- protocolVersion: 2 | ||
name: cifar10 | ||
type: data | ||
version: latest | ||
contributor: Alice | ||
description: cifar10 dataset, image classification | ||
uri: | ||
- https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz | ||
|
||
parameters: | ||
model: resnet20 | ||
batchsize: 32 | ||
|
||
secrets: | ||
docker_password: password | ||
github_token: cGFzc3dvcmQ= | ||
|
||
jobRetryCount: 1 | ||
taskRoles: | ||
worker: | ||
instances: 1 | ||
completion: | ||
minFailedInstances: 1 | ||
minSucceededInstances: 1 | ||
taskRetryCount: 0 | ||
dockerImage: tf_example | ||
data: cifar10 | ||
output: tensorflow_cifar10_model | ||
script: tensorflow_cnnbenchmarks | ||
extraContainerOptions: | ||
shmMB: 64 | ||
resourcePerInstance: | ||
cpu: 2 | ||
memoryMB: 16384 | ||
gpu: 4 | ||
ports: | ||
ssh: 1 | ||
http: 1 | ||
commands: | ||
- cd script_<% $script.name %>/scripts/tf_cnn_benchmarks | ||
- > | ||
python tf_cnn_benchmarks.py --job_name=worker | ||
--local_parameter_device=gpu | ||
--variable_update=parameter_server | ||
--ps_hosts=$PAI_TASK_ROLE_ps_server_HOST_LIST | ||
--worker_hosts=$PAI_TASK_ROLE_worker_HOST_LIST | ||
--task_index=$PAI_CURRENT_TASK_ROLE_CURRENT_TASK_INDEX | ||
--data_name=<% $data.name %> | ||
--data_dir=$PAI_WORK_DIR/data_<% $data.name %> | ||
--train_dir=$PAI_WORK_DIR/output_<% $output.name %> | ||
--model=<% $parameters.model %> | ||
--batch_size=<% $parameters.batchsize %> | ||
ps_server: | ||
instances: 1 | ||
completion: | ||
minFailedInstances: 1 | ||
minSucceededInstances: -1 | ||
taskRetryCount: 0 | ||
dockerImage: tf_example | ||
data: cifar10 | ||
output: tensorflow_cifar10_model | ||
script: tensorflow_cnnbenchmarks | ||
extraContainerOptions: | ||
shmMB: 64 | ||
resourcePerInstance: | ||
cpu: 2 | ||
memoryMB: 8192 | ||
gpu: 0 | ||
ports: | ||
ssh: 1 | ||
http: 1 | ||
commands: | ||
- cd script_<% $script.name %>/scripts/tf_cnn_benchmarks | ||
- > | ||
python tf_cnn_benchmarks.py --job_name=ps | ||
--local_parameter_device=gpu | ||
--variable_update=parameter_server | ||
--ps_hosts=$PAI_TASK_ROLE_ps_server_HOST_LIST | ||
--worker_hosts=$PAI_TASK_ROLE_worker_HOST_LIST | ||
--task_index=$PAI_CURRENT_TASK_ROLE_CURRENT_TASK_INDEX | ||
--data_dir=$PAI_WORK_DIR/data_<% $data.name %> | ||
--data_name=<% $data.name %> | ||
--train_dir=$PAI_WORK_DIR/output_<% $output.name %> | ||
--model=<% $parameters.model %> | ||
--batch_size=<% $parameters.batchsize %> | ||
deployments: | ||
- name: prod # This implementation will download the data to local disk, and the computed model will be output to local disk first and then being copied to hdfs. | ||
version: 1.0 | ||
taskRoles: | ||
worker: | ||
preCommands: | ||
- wget <% $data.uri[0] %> -P data_<% $data.name %> # If local data cache deployed, one can copy data from local cache, only wget in case of cache miss. | ||
- > | ||
git clone https://<% $script.contributor %>:<% $secrets.github_token %>@<% $script.uri %> script_<% $script.name %> && | ||
cd script_<% $script.name %> && git checkout <% $script.version %> && cd .. | ||
# Then the system will go ahead to execute worker's command. | ||
ps_server: | ||
preCommands: | ||
- wget <% $data.uri[0] %> -P data_<% $data.name %> | ||
- > | ||
git clone https://<% $script.contributor %>:<% $secrets.github_token %>@<% $script.uri %> script_<% $script.name %> && | ||
cd script_<% $script.name %> && git checkout <% $script.version %> && cd .. | ||
# Then the system will go ahead to execute ps_server's command. | ||
postCommands: | ||
# After the execution of ps_server's command, the system goes here. | ||
- hdfs dfs -cp output_<% $output.name %> <% $output.uri %> | ||
# Assume the model is output locally, and this command copies the local output to hdfs. One can output to hdfs directly. | ||
# In this case, you will have to change "--train_dir=$PAI_WORK_DIR/output_<% $output.name %>". | ||
|
||
defaults: | ||
deployment: prod # Use prod deployment in job submission. |
Oops, something went wrong.