Skip to content

How to do testing during training when using ParallelExecutor.  #9571

@qingqing01

Description

@qingqing01
  1. 如何做并行test?
    参考代码:https://github.com/dzhwinter/benchmark/pull/91/files
    ParallelExecutor用于train:
    exe =  fluid.ParallelExecutor(loss_name=avg_cost.name, use_cuda=True)
    for i xrange(iterations):
       loss = exe.run([avg_cost.name])
    如何用于test?是类似如下吗?
    test_exe =  fluid.ParallelExecutor(loss_name=avg_cost.name, use_cuda=True, main_program =test_program)
    for i xrange(test_iterations):
       loss, top1, top5 = test_exe.run([avg_cost.name, top1.name, top5.name])

ParallelExecutor用于test,当前存在以下问题】:


  1. ParallelExecutor构造函数【始终】运行了一个startup_program。
    • 代码: https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/parallel_executor.cc#L56
    • 存在的问题:并行测试startup_program是什么?要去掉startup_program里参数初始化吗?
    • 是否要移动到Python里?:
        1. 删掉C++里的exe.Run(startup_program, scope, 0);
        1. Python中ParallelExecutor构造函数判断是否要做初始化?
        class ParallelExecutor(object):
          def __init__(self, loss_name, use_cuda, num_threads=None,
               main_program=None, startup_program=None, run_startup=True):
            # ...
            startup = startup_program if startup_program else framework.default_startup_program()
            if run_startup:
              place = core.CUDAPlace(0) if use_cuda else core.CPUPlace()
              exe = executor.Executor(place)
              exe.run(startup)

  1. ParallelExecutor里不管输入的Program是什么,【始终】创建grad vars和插入用于grad聚合的NCCLAllReduceOp

  1. ParallelExecutor不支持Python data reader。Recordio
    参考代码 https://github.com/dzhwinter/benchmark/pull/91/files
    train的Program定义如下, 训练数据路径是./flowers.train.recordio,以及相关的var是train Program的一部分。
    with fluid.program_guard(main, startup):
        reader = fluid.layers.open_recordio_file(
            filename='./flowers.train.recordio',
            shapes=[[-1, 3, 224, 224], [-1, 1]],
            lod_levels=[0, 0],
            dtypes=['float32', 'int64'])
        image, label = fluid.layers.read_file(reader)
        prediction, avg_cost, accuracy, accuracy5 = net_conf(image, label, class_dim)
    问题是:test时,数据路径是'./flowers.test.recordio',不同于train, 如何获取test Program?

  1. fluid.ParallelExecutor的输入有个loss_name,test时如何指定?
    代码在:https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/parallel_executor.py#L24
    看代码https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/details/multi_devices_graph_builder.cc#L94loss_name似乎是用来分割forward ops和backward ops,插入Grad聚合op的一个辅助变量。

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions