Closed
Description
@Yancey1989 wrote this job submit tools at: https://github.com/Yancey1989/paddle-job
currently submiting a job looks like:
paddle.init(
use_gpu=False,
trainer_count=1,
port=7164,
ports_num=1,
ports_num_for_sparse=1,
num_gradient_servers=1,
trainer_id=fetch_trainer_id(),
pservers=fetch_pserver_ips())
job.dist_train(
trainer=trainer,
reader=paddle.batch(paddle.dataset.imikolov.train(word_dict, N), 32),
num_passes=30,
event_handler=event_handler,
paddle_job=job.PaddleJob(
pservers=3,
base_image="yancey1989/paddle-cloud",
input="/yanxu05",
output="/yanxu05",
job_name="paddle-cloud",
namespace="yanxu",
use_gpu=False,
cpu_num=3,
trainer_package_path="/example/word2vec",
entry_point="python api_train_v2.py"))
We want to make it simpler like:
# init from ENV "PADDLE_*", args below will overwrite the ENVs
paddle.init(use_gpu=False)
...
myjob = job.dist_train(
trainer=trainer,
reader=my_dist_reader("dataset-name"),
num_passes=30,
event_handler=event_handler,
paddle_job=job.PaddleJob(
[cluster configurations...]))
print "view job status at: ", myjob.status_url()
Required ENVs:
- "PADDLE_PSERVERS"
- "PADDLE_TRAINER_ID"
- "PADDLE_TRAINER_COUNT"
- "PADDLE_NUM_GRADIENT_SERVERS"
- "PADDLE_PORTS_NUM_FOR_SPARSE"
Optional ENVs:
- "PADDLE_PORT": default 7164
- "PADDLE_PORTS_NUM": default 1
- "PADDLE_USE_GPU": default False
Cluster Job Configurations:
Job Resources
- parallism: parallism equals to num of trainer, the num of pservers is caculated from parallism.
- num_gpus: gpu resources needed, if
num_gpus ==0
and env "PADDLE_USE_GPU" set to True or the oppsite, paddle will throw a warning message when submiting a job. - num_cpus: cpu resource
- entry_point: command to start your trainning program:
python /data/cloud/storage/path/train.py
- NOTE: Paddle will default mount your cloud storage volume at
/data
, so your trainning program can read data any where under/data
Advanced settings:
- pservers: if this is set, num of pservers will be set to this value instead of auto caculated from parallism.
- base_image: use your own image to run
- job_name: use your own job name
- NOTE: namespace is read from ENV: "USER_NAMESPACE"
Metadata
Metadata
Assignees
Labels
No labels