Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Commit

Permalink
Migrate
Browse files Browse the repository at this point in the history
Migrate OpenPAI protocol from https://github.com/microsoft/pai.
  • Loading branch information
abuccts committed Jan 2, 2020
1 parent 4d917e4 commit 8c21ddf
Show file tree
Hide file tree
Showing 7 changed files with 348 additions and 64 deletions.
4 changes: 4 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
* eol=lf
*.md text
*.yaml text
*.yml text
23 changes: 23 additions & 0 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
name: Lint

on:
push:
branches:
- master
pull_request:
branches:
- master

jobs:
spelling:
name: Spelling check
runs-on: ubuntu-16.04
steps:
- name: Checkout
uses: actions/checkout@v1
- name: Install dependencies
run: |
curl -L https://git.io/misspell | sudo bash -s -- -b /bin
- name: Check spelling
run: |
misspell -error .
18 changes: 9 additions & 9 deletions CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Microsoft Open Source Code of Conduct

This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).

Resources:

- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
- Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns
# Microsoft Open Source Code of Conduct

This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).

Resources:

- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
- Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns
42 changes: 28 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,28 @@

# Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
OpenPAI Protocol
================

[OpenPAI](https://github.com/microsoft/pai) Protocol is a specification that includes:
- The resource requirement, including the docker image used by the job container, and the data used by the job, etc.
- Various requirement like the GPU/CPU usage, container role and job completion policy used by [Framework Controller](https://github.com/microsoft/frameworkcontroller).
- Scheduling requirement used by [Framework Controller](https://github.com/microsoft/frameworkcontroller) and [HiveD Scheduler](https://github.com/microsoft/hivedscheduler).
- Runtime environment variables, if needed.

OpenPAI protocol enables job sharing and portability: a job specified by the protocol can run in different OpenPAI deployment.
The protocol also allows users to make a template of a job, which further facilitates the sharing and collaboration in a team that has similar but slightly different configuration for a class of jobs.
With the protocol, OpenPAI introduces marketplace that hosts job or job templates that share with other people.


Contributing
------------

This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
82 changes: 41 additions & 41 deletions SECURITY.md
Original file line number Diff line number Diff line change
@@ -1,41 +1,41 @@
<!-- BEGIN MICROSOFT SECURITY.MD V0.0.3 BLOCK -->

## Security

Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/).

If you believe you have found a security vulnerability in any Microsoft-owned repository that meets Microsoft's [Microsoft's definition of a security vulnerability](https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc751383(v=technet.10)) of a security vulnerability, please report it to us as described below.

## Reporting Security Issues

**Please do not report security vulnerabilities through public GitHub issues.**

Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://msrc.microsoft.com/create-report).

If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc).

You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc).

Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:

* Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
* Full paths of source file(s) related to the manifestation of the issue
* The location of the affected source code (tag/branch/commit or direct URL)
* Any special configuration required to reproduce the issue
* Step-by-step instructions to reproduce the issue
* Proof-of-concept or exploit code (if possible)
* Impact of the issue, including how an attacker might exploit the issue

This information will help us triage your report more quickly.

If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://microsoft.com/msrc/bounty) page for more details about our active programs.

## Preferred Languages

We prefer all communications to be in English.

## Policy

Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://www.microsoft.com/en-us/msrc/cvd).

<!-- END MICROSOFT SECURITY.MD BLOCK -->
<!-- BEGIN MICROSOFT SECURITY.MD V0.0.3 BLOCK -->

## Security

Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/).

If you believe you have found a security vulnerability in any Microsoft-owned repository that meets Microsoft's [Microsoft's definition of a security vulnerability](https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc751383(v=technet.10)) of a security vulnerability, please report it to us as described below.

## Reporting Security Issues

**Please do not report security vulnerabilities through public GitHub issues.**

Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://msrc.microsoft.com/create-report).

If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc).

You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc).

Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:

* Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
* Full paths of source file(s) related to the manifestation of the issue
* The location of the affected source code (tag/branch/commit or direct URL)
* Any special configuration required to reproduce the issue
* Step-by-step instructions to reproduce the issue
* Proof-of-concept or exploit code (if possible)
* Impact of the issue, including how an attacker might exploit the issue

This information will help us triage your report more quickly.

If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://microsoft.com/msrc/bounty) page for more details about our active programs.

## Preferred Languages

We prefer all communications to be in English.

## Policy

Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://www.microsoft.com/en-us/msrc/cvd).

<!-- END MICROSOFT SECURITY.MD BLOCK -->
146 changes: 146 additions & 0 deletions examples/v2/example.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# OpenPAI Job Protocol YAML Example for a Distributed TensorFlow Job

protocolVersion: 2
name: tensorflow_cifar10
type: job
version: 1.0
contributor: Alice
description: image classification, cifar10 dataset, tensorflow, distributed training

prerequisites:
- protocolVersion: 2
name: tf_example
type: dockerimage
version: latest
contributor: Alice
description: python3.5, tensorflow
auth:
username: user
password: <% $secrets.docker_password %>
registryuri: openpai.azurecr.io
uri: openpai/pai.example.tensorflow
- protocolVersion: 2
name: tensorflow_cifar10_model
type: output
version: latest
contributor: Alice
description: cifar10 data output
uri: hdfs://10.151.40.179:9000/core/cifar10_model
- protocolVersion: 2
name: tensorflow_cnnbenchmarks
type: script
version: 84820935288cab696c9c2ac409cbd46a1f24723d
contributor: MaggieQi
description: tensorflow benchmarks
uri: github.com/MaggieQi/benchmarks
- protocolVersion: 2
name: cifar10
type: data
version: latest
contributor: Alice
description: cifar10 dataset, image classification
uri:
- https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz

parameters:
model: resnet20
batchsize: 32

secrets:
docker_password: password
github_token: cGFzc3dvcmQ=

jobRetryCount: 1
taskRoles:
worker:
instances: 1
completion:
minFailedInstances: 1
minSucceededInstances: 1
taskRetryCount: 0
dockerImage: tf_example
data: cifar10
output: tensorflow_cifar10_model
script: tensorflow_cnnbenchmarks
extraContainerOptions:
shmMB: 64
resourcePerInstance:
cpu: 2
memoryMB: 16384
gpu: 4
ports:
ssh: 1
http: 1
commands:
- cd script_<% $script.name %>/scripts/tf_cnn_benchmarks
- >
python tf_cnn_benchmarks.py --job_name=worker
--local_parameter_device=gpu
--variable_update=parameter_server
--ps_hosts=$PAI_TASK_ROLE_ps_server_HOST_LIST
--worker_hosts=$PAI_TASK_ROLE_worker_HOST_LIST
--task_index=$PAI_CURRENT_TASK_ROLE_CURRENT_TASK_INDEX
--data_name=<% $data.name %>
--data_dir=$PAI_WORK_DIR/data_<% $data.name %>
--train_dir=$PAI_WORK_DIR/output_<% $output.name %>
--model=<% $parameters.model %>
--batch_size=<% $parameters.batchsize %>
ps_server:
instances: 1
completion:
minFailedInstances: 1
minSucceededInstances: -1
taskRetryCount: 0
dockerImage: tf_example
data: cifar10
output: tensorflow_cifar10_model
script: tensorflow_cnnbenchmarks
extraContainerOptions:
shmMB: 64
resourcePerInstance:
cpu: 2
memoryMB: 8192
gpu: 0
ports:
ssh: 1
http: 1
commands:
- cd script_<% $script.name %>/scripts/tf_cnn_benchmarks
- >
python tf_cnn_benchmarks.py --job_name=ps
--local_parameter_device=gpu
--variable_update=parameter_server
--ps_hosts=$PAI_TASK_ROLE_ps_server_HOST_LIST
--worker_hosts=$PAI_TASK_ROLE_worker_HOST_LIST
--task_index=$PAI_CURRENT_TASK_ROLE_CURRENT_TASK_INDEX
--data_dir=$PAI_WORK_DIR/data_<% $data.name %>
--data_name=<% $data.name %>
--train_dir=$PAI_WORK_DIR/output_<% $output.name %>
--model=<% $parameters.model %>
--batch_size=<% $parameters.batchsize %>
deployments:
- name: prod # This implementation will download the data to local disk, and the computed model will be output to local disk first and then being copied to hdfs.
version: 1.0
taskRoles:
worker:
preCommands:
- wget <% $data.uri[0] %> -P data_<% $data.name %> # If local data cache deployed, one can copy data from local cache, only wget in case of cache miss.
- >
git clone https://<% $script.contributor %>:<% $secrets.github_token %>@<% $script.uri %> script_<% $script.name %> &&
cd script_<% $script.name %> && git checkout <% $script.version %> && cd ..
# Then the system will go ahead to execute worker's command.
ps_server:
preCommands:
- wget <% $data.uri[0] %> -P data_<% $data.name %>
- >
git clone https://<% $script.contributor %>:<% $secrets.github_token %>@<% $script.uri %> script_<% $script.name %> &&
cd script_<% $script.name %> && git checkout <% $script.version %> && cd ..
# Then the system will go ahead to execute ps_server's command.
postCommands:
# After the execution of ps_server's command, the system goes here.
- hdfs dfs -cp output_<% $output.name %> <% $output.uri %>
# Assume the model is output locally, and this command copies the local output to hdfs. One can output to hdfs directly.
# In this case, you will have to change "--train_dir=$PAI_WORK_DIR/output_<% $output.name %>".

defaults:
deployment: prod # Use prod deployment in job submission.
Loading

0 comments on commit 8c21ddf

Please sign in to comment.