Skip to content

Simpler cluster train job submit code #2047

Closed
@typhoonzero

Description

@typhoonzero

@Yancey1989 wrote this job submit tools at: https://github.com/Yancey1989/paddle-job

currently submiting a job looks like:

paddle.init(
            use_gpu=False,
            trainer_count=1,
            port=7164,
            ports_num=1,
            ports_num_for_sparse=1,
            num_gradient_servers=1,
            trainer_id=fetch_trainer_id(),
            pservers=fetch_pserver_ips())
job.dist_train(
        trainer=trainer,
        reader=paddle.batch(paddle.dataset.imikolov.train(word_dict, N), 32),
        num_passes=30,
        event_handler=event_handler,
        paddle_job=job.PaddleJob(
            pservers=3,
            base_image="yancey1989/paddle-cloud",
            input="/yanxu05",
            output="/yanxu05",
            job_name="paddle-cloud",
            namespace="yanxu",
            use_gpu=False,
            cpu_num=3,
            trainer_package_path="/example/word2vec",
            entry_point="python api_train_v2.py"))

We want to make it simpler like:

# init from ENV "PADDLE_*", args below will overwrite the ENVs
paddle.init(use_gpu=False)
...
myjob = job.dist_train(
        trainer=trainer,
        reader=my_dist_reader("dataset-name"),
        num_passes=30,
        event_handler=event_handler,
        paddle_job=job.PaddleJob(
            [cluster configurations...]))
print "view job status at: ", myjob.status_url()

Required ENVs:

  • "PADDLE_PSERVERS"
  • "PADDLE_TRAINER_ID"
  • "PADDLE_TRAINER_COUNT"
  • "PADDLE_NUM_GRADIENT_SERVERS"
  • "PADDLE_PORTS_NUM_FOR_SPARSE"

Optional ENVs:

  • "PADDLE_PORT": default 7164
  • "PADDLE_PORTS_NUM": default 1
  • "PADDLE_USE_GPU": default False

Cluster Job Configurations:

Job Resources

  • parallism: parallism equals to num of trainer, the num of pservers is caculated from parallism.
  • num_gpus: gpu resources needed, if num_gpus ==0 and env "PADDLE_USE_GPU" set to True or the oppsite, paddle will throw a warning message when submiting a job.
  • num_cpus: cpu resource
  • entry_point: command to start your trainning program: python /data/cloud/storage/path/train.py
  • NOTE: Paddle will default mount your cloud storage volume at /data, so your trainning program can read data any where under /data

Advanced settings:

  • pservers: if this is set, num of pservers will be set to this value instead of auto caculated from parallism.
  • base_image: use your own image to run
  • job_name: use your own job name
  • NOTE: namespace is read from ENV: "USER_NAMESPACE"

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions