How to do testing during training when using ParallelExecutor. 

1. 如何做并行test？
    参考代码：https://github.com/dzhwinter/benchmark/pull/91/files   
    ParallelExecutor用于train：
    ```python
    exe =  fluid.ParallelExecutor(loss_name=avg_cost.name, use_cuda=True)
    for i xrange(iterations):
       loss = exe.run([avg_cost.name])
    ```
    如何用于test？是类似如下吗？
    ```python
    test_exe =  fluid.ParallelExecutor(loss_name=avg_cost.name, use_cuda=True, main_program =test_program)
    for i xrange(test_iterations):
       loss, top1, top5 = test_exe.run([avg_cost.name, top1.name, top5.name])
    ```

【**ParallelExecutor用于test，当前存在以下问题**】：

----
2.  ParallelExecutor构造函数【**始终**】运行了一个startup_program。
    - 代码： https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/parallel_executor.cc#L56
    - 存在的问题：并行测试startup_program是什么？要去掉startup_program里参数初始化吗？
    - 是否要移动到Python里？：
        - 1)  删掉C++里的[`exe.Run(startup_program, scope, 0);`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/parallel_executor.cc#L56)
        - 2) Python中ParallelExecutor构造函数判断是否要做初始化？
          ```python
          class ParallelExecutor(object):
            def __init__(self, loss_name, use_cuda, num_threads=None,
                 main_program=None, startup_program=None, run_startup=True):
              # ...
              startup = startup_program if startup_program else framework.default_startup_program()
              if run_startup:
                place = core.CUDAPlace(0) if use_cuda else core.CPUPlace()
                exe = executor.Executor(place)
                exe.run(startup)
          ```
---
3. ParallelExecutor里不管输入的Program是什么，【**始终**】创建grad vars和插入用于grad聚合的`NCCLAllReduceOp` 。
    -  插入Grad聚合代码：
        - 1) 创建`MultiDevSSAGraphBuilder`: https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/parallel_executor.cc#L75
        - 2) 构造函数始终创建了`Grad Vas`: https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/details/multi_devices_graph_builder.cc#L50
        - 3) 插入`NCCLAllReduceOp`: https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/details/multi_devices_graph_builder.cc#L130
    - 存在的问题：**并行测试并不需要这些操作**

---
4.  ParallelExecutor不支持Python data reader。Recordio
    参考代码 https://github.com/dzhwinter/benchmark/pull/91/files
    train的Program定义如下， 训练数据路径是`./flowers.train.recordio`，以及相关的var是train Program的一部分。
    ```python
    with fluid.program_guard(main, startup):
        reader = fluid.layers.open_recordio_file(
            filename='./flowers.train.recordio',
            shapes=[[-1, 3, 224, 224], [-1, 1]],
            lod_levels=[0, 0],
            dtypes=['float32', 'int64'])
        image, label = fluid.layers.read_file(reader)
        prediction, avg_cost, accuracy, accuracy5 = net_conf(image, label, class_dim)
    ```
    问题是：test时，数据路径是`'./flowers.test.recordio'`，不同于train， 如何获取test Program？

----
5.  `fluid.ParallelExecutor`的输入有个`loss_name`，test时如何指定？
    代码在：https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/parallel_executor.py#L24 。
    看代码https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/details/multi_devices_graph_builder.cc#L94 ， `loss_name`似乎是用来分割forward ops和backward ops，插入Grad聚合op的一个辅助变量。 
    

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to do testing during training when using ParallelExecutor. #9571

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to do testing during training when using ParallelExecutor. #9571

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions