Fused ImageDecoderRandomCropResize #3644

liujuncheng · 2020-10-04T08:26:31Z

添加 ofrecord_bytes_decoder op，用于将 OFRecord 中的 bytes_list decode 为 TensorBuffer
添加 image_decoder_random_crop_resize op，实现 decode + random_crop + resize 功能，输入为 TensorBuffer，输出为静态形状，目前仅支持 NHWC RGB格式
image_decoder_random_crop_resize 在 CUDA 10.2 以上支持 GPU，使用 nvjpeg 实现 decode，使用 npp 实现 resize
添加了 DecodeH2D 类型的 TaskNode，类似 CopyH2D，消费 host 内存，产生 gpu 内存，image_decoder_random_crop_resize GPU 使用 DecodeH2D 线程
多个worker分配任务采用的方法是共享一个按照解码前大小从大到小排列的任务队列和一个原子计数器，各个worker通过计数器atomic add 确定当先需要执行的任务，这样在一个 batch 中存在大小差别较大的任务时，尽量保证各个worker的工作时间接近
GPU 支持 jpeg partial decode
GPU 支持 Warmup，可以在第一次运行便为 nvjpeg 分配尽可能足够的内存，避免运行过程中的内存分配

使用

    ofrecord = flow.data.ofrecord_reader(args.train_data_dir,
                                        batch_size=train_batch_size,
                                        data_part_num=args.train_data_part_num,
                                        part_name_suffix_length=5,
                                        random_shuffle=True,
                                        shuffle_after_epoch=True)
    label = flow.data.OFRecordRawDecoder(
        ofrecord, "class/label", shape=(), dtype=flow.int32)
    encoded = flow.data.OFRecordBytesDecoder(ofrecord, "encoded")
    image = flow.data.ImageDecoderRandomCropResize(encoded, target_width=224, target_height=224)
    rng = flow.random.CoinFlip(batch_size=train_batch_size)  # , seed=seed)
    normal = flow.image.CropMirrorNormalize(image, mirror_blob=rng,
                                            color_space=color_space, output_layout=output_layout,
                                            mean=args.rgb_mean, std=args.rgb_std, output_dtype=flow.float)

性能

ResNet50 FP16 8 * V100 bs=256

GPU

train: epoch 0, iter 20, loss: 7.049120, top_1: 0.000977, top_k: 0.005396, samples/s: 3413.952 1601791712.4514203
train: epoch 0, iter 40, loss: 6.915249, top_1: 0.001392, top_k: 0.006152, samples/s: 10381.964 1601791716.396608
train: epoch 0, iter 60, loss: 6.829755, top_1: 0.004248, top_k: 0.015649, samples/s: 10337.388 1601791720.3588648
train: epoch 0, iter 80, loss: 6.755356, top_1: 0.007495, top_k: 0.026685, samples/s: 10355.406 1601791724.3143783
train: epoch 0, iter 100, loss: 6.689157, top_1: 0.010449, top_k: 0.035913, samples/s: 10301.990 1601791728.2902317
train: epoch 0, iter 120, loss: 6.627316, top_1: 0.012305, top_k: 0.042261, samples/s: 10372.650 1601791732.239128

CPU

train: epoch 0, iter 20, loss: 7.041750, top_1: 0.001147, top_k: 0.005469, samples/s: 4444.315 1601796276.6427438
train: epoch 0, iter 40, loss: 6.916680, top_1: 0.001587, top_k: 0.006909, samples/s: 8866.884 1601796281.2621408
train: epoch 0, iter 60, loss: 6.834092, top_1: 0.003931, top_k: 0.015503, samples/s: 8435.938 1601796286.1175745
train: epoch 0, iter 80, loss: 6.757495, top_1: 0.007642, top_k: 0.026367, samples/s: 8772.543 1601796290.7866051
train: epoch 0, iter 100, loss: 6.671433, top_1: 0.010522, top_k: 0.037256, samples/s: 8873.559 1601796295.4037716
train: epoch 0, iter 120, loss: 6.574611, top_1: 0.012695, top_k: 0.043262, samples/s: 8877.422 1601796300.0165431

精度

ResNet50 FP16 50 epochs 75.8%+, 90 epoch 77%+

已知问题

可能和目前的 conv 算法搜索有冲突，开启train_config.cudnn_conv_heuristic_search_algo(True)选项有可能提高性能

TODO

解决和 conv 算法搜索的冲突
CPU 支持 partial decode

oneflow/core/operator/image_decoder_random_crop_resize_op.cpp

lixinqi · 2020-10-04T09:43:17Z

oneflow/core/kernel/image_decoder_random_crop_resize_kernel.cpp

+  std::shared_ptr<BlockingCounter> done_counter(new BlockingCounter(workers_.size()));
+  std::shared_ptr<std::atomic<int>> task_counter(new std::atomic<int>(0));
+  std::shared_ptr<std::vector<Task>> tasks(new std::vector<Task>(batch_size));


感觉这些指针都可以用unique_ptr，它的成本更低。

感觉这些指针都可以用unique_ptr，它的成本更低。

这个几个对象多个work是共享的

oneflow/core/kernel/image_decoder_random_crop_resize_kernel.cpp

oneflow/core/operator/image_decoder_random_crop_resize_op.cpp

oneflow/core/kernel/image_decoder_random_crop_resize_kernel.cpp

lixinqi · 2020-10-04T10:06:11Z

oneflow/core/kernel/image_decoder_random_crop_resize_kernel.cpp

+  std::shared_ptr<std::atomic<int>> task_counter;
+};
+
+void GenerateRandomCropRoi(RandomCropGenerator* crop_generator, int width, int height, int* roi_x,


Roi是不是应该取名为Region啊？因为roi的意思是region of interst

Roi是不是应该取名为Region啊？因为roi的意思是region of interst

Roi应该是图片处理的通用叫法，参考 https://docs.nvidia.com/cuda/archive/10.2/nvjpeg/index.html#nvjpeg-decode-params-set-roi 也是叫 roi

lixinqi · 2020-10-04T10:24:23Z

oneflow/core/kernel/image_decoder_random_crop_resize_kernel.cpp

+  dev_allocator_.dev_malloc = &GpuDeviceMalloc;
+  dev_allocator_.dev_free = &GpuDeviceFree;
+  pinned_allocator_.pinned_malloc = &GpuPinnedMalloc;
+  pinned_allocator_.pinned_free = &GpuPinnedFree;


这里是不是有优化空间，以后可以使用成诚写的内存分配器？如果可以的话，加一个TODO(liujuncheng)

lixinqi · 2020-10-04T10:29:20Z

oneflow/core/kernel/image_decoder_random_crop_resize_kernel.cpp

+  warmup_done_ = true;
+}
+
+void GpuDecodeHandle::Synchronize() { OF_CUDA_CHECK(cudaStreamSynchronize(cuda_stream_)); }


这里会不会遇到去年年底因linux内核版本不一样，线程唤醒时间变长导致效率没有busy loop快的问题？

这里会不会遇到去年年底因linux内核版本不一样，线程唤醒时间变长导致效率没有busy loop快的问题？

不排除这种可能，但是暂时还没有遇到

目前这个版本是一个基本的实现，stream的使用以及allocator等还有很多优化空间，未来这里可能会根据情况重构，之后stream等的使用方式和现在会不一样。

oneflow/core/device/cuda_util.h

* Add ImageDecoderRandomCropResizeOp::InferParallelSignature * Add Scope::scope_proto

oneflow/core/device/cuda_util.h

yuanms2 · 2020-10-05T01:12:20Z

oneflow/core/graph/decode_h2d_compute_task_node.h

+
+  TaskType GetTaskType() const override { return TaskType::kDecodeH2D; }
+  CudaWorkType GetCudaWorkType() const override {
+#ifdef WITH_CUDA


这里是否也判断下cuda 版本

这里是否也判断下cuda 版本

这里不需要判断版本，DecodeH2D用于可以直接将数据从cpu解码到gpu的op，类似CopyHD，但是CopyHD是异步的，DecodeH2D可能没有办法异步，为了避免影响CopyHD，单独作为了一种Task类型。只有nvjpeg限制cuda版本

如果需要独立的线程，需要重写这类的TaskNode::IsIndependent()方法。

如果需要独立的线程，需要重写这类的TaskNode::IsIndependent()方法。

用的是GPU上的特定线程，不是独立线程，GPU目前应该也不支持独立线程吧

oneflow/core/device/cuda_util.h

chengtbf · 2020-10-05T01:07:56Z

oneflow/core/graph/decode_h2d_compute_task_node.cpp

+}
+
+void DecodeH2DCompTaskNode::ProduceAllRegstsAndBindEdges() {
+  std::shared_ptr<RegstDesc> out_regst = ProduceRegst("out", false, 2, 2);


我感觉使用了这个DecodeH2D以后，我们之前CopyH2D regst num 强制 = 2的trick可以去掉了。这样还能优化boxing v2带来的2倍模型显存的开销？

oneflow/core/graph/logical_node.cpp

oneflow/core/graph/task_graph.cpp

oneflow/core/kernel/kernel.proto

chengtbf · 2020-10-05T01:26:07Z

oneflow/core/operator/image_decoder_random_crop_resize_op.cpp

+    out->mut_shape() = Shape(out_dim_vec);
+    BlobDesc* tmp = GetBlobDesc4BnInOp("tmp");
+    tmp->set_data_type(DataType::kUInt8);
+    tmp->mut_shape() = Shape({conf.max_num_pixels() * 3 * conf.num_workers()});


这个tmp是自己强行分配了一个足够大的内存吧，shape是不是有可能不够？在极端情况下（有一张特别大的图片）。在大多数情况下会很浪费内存。

DALI的做法是动态申请GPU内存（但是是非常稀疏的，初始就分配了1M,在绝大多数情况下都能满足），如果内存不够，就释放掉重新申请一个更大的。如果后面的图片都是很小的图片，内存会被释放再申请一个小一点的。（不过在深度学习的需求里，内存重新申请没有太大意义）

我们目前可以保留静态大buffer的方案。但是这个阈值要设置的合理一些，不要因为这个tmp过大导致我们的batch size不能设置到足够大。

哦哦哦我这里的理解不对，你在kernel里有分配临时内存的接口。这个tmp blob是干嘛用的？

这个tmp是自己强行分配了一个足够大的内存吧，shape是不是有可能不够？在极端情况下（有一张特别大的图片）。在大多数情况下会很浪费内存。

DALI的做法是动态申请GPU内存（但是是非常稀疏的，初始就分配了1M,在绝大多数情况下都能满足），如果内存不够，就释放掉重新申请一个更大的。如果后面的图片都是很小的图片，内存会被释放再申请一个小一点的。（不过在深度学习的需求里，内存重新申请没有太大意义）

我们目前可以保留静态大buffer的方案。但是这个阈值要设置的合理一些，不要因为这个tmp过大导致我们的batch size不能设置到足够大。

这个 buffer 是 decode 和 resize 之间的临时内存，这部分都是我们自己使用的，所以没有用动态方案。

除非有实现比较好的内存池，否者这里动态申请内存对性能影响非常大，tmp buffer 以及 warmup 都是为了避免内存重分配的。

目前 tmp 的大小是按照imagenet数据集的基础上又保留了一些余量，大部分场景是够用的。

chengtbf · 2020-10-05T01:30:01Z

GPU 支持 Warmup，可以在第一次运行便为 nvjpeg 分配尽可能足够的内存，避免运行过程中的内存分配

这里的Warmup逻辑是什么？在第一次运行时，把剩余的内存都申请下来用于decode？

chengtbf · 2020-10-05T01:34:36Z

oneflow/core/kernel/image_decoder_random_crop_resize_kernel.cpp

+void GpuDecodeHandle::WarmupOnce(int warmup_size, unsigned char* workspace, size_t workspace_size) {
+  if (warmup_done_) { return; }
+  warmup_size = std::min(static_cast<int>(std::sqrt(workspace_size / kNumChannels)), warmup_size);
+  cv::Mat image = cv::Mat::zeros(cv::Size(warmup_size, warmup_size), CV_8UC3);


这里的逻辑是创建一个(warmup_size， warmup_size, 3)大小的全零图片，然后尝试decode？

这个warmup size是用户配置的？我见是conf里传进来的。用户怎么知道要传多大？

还是说这个size其实可以随便设置，然后在训练过程中，只要遇到了更大的图片，就会触发重新malloc的逻辑？

看了一下DALI的文档，DALI是设置的16M显存，一般就够用了

device_memory_padding (int, optional, default = 16777216) – `mixed` backend only Padding for nvJPEG’s device memory allocations in bytes. This parameter helps to avoid reallocation in nvJPEG whenever a bigger image is encountered and the internal buffer needs to be reallocated to decode it.

链接： https://docs.nvidia.com/deeplearning/dali/user-guide/docs/supported_ops.html#nvidia.dali.ops.ImageDecoder

看了一下DALI的文档，DALI是设置的16M显存，一般就够用了

device_memory_padding (int, optional, default = 16777216) – `mixed` backend only Padding for nvJPEG’s device memory allocations in bytes. This parameter helps to avoid reallocation in nvJPEG whenever a bigger image is encountered and the internal buffer needs to be reallocated to decode it.

链接： https://docs.nvidia.com/deeplearning/dali/user-guide/docs/supported_ops.html#nvidia.dali.ops.ImageDecoder

这个是指分配显存的时候按照 device_memory_padding 来申请，避免频繁的重分配

处理imagenet数据集的时候，单张图片最大需要120M左右的显存

还是说这个size其实可以随便设置，然后在训练过程中，只要遇到了更大的图片，就会触发重新malloc的逻辑？

是的

chengtbf · 2020-10-05T01:37:22Z

oneflow/python/ops/data_ops.py

+    random_aspect_ratio: Optional[Sequence[float]] = None,
+    num_workers: Optional[int] = None,
+    warmup_size: Optional[int] = None,
+    max_num_pixels: Optional[int] = None,


num_workers、warmup_size、max_num_pixels 这几个都是非常重要的参数，这里需要增加API注释文档，用于解释这些参数是干嘛的，应该设置成多大比较合适。

需要简要描述一下这个Op的原理和设计。

如果有可能的话，PR合并以后，补充增加一个单测，让CI帮助检查。

增加单测也是让其他人知道这个op的使用方式。如果不方便，那就在
/oneflow/oneflow/python/test/customized 路径下补充一个示例（就是把PR的第一个comment里的使用代码粘过去），就像我之前增加的new_data_pre.py一样。

num_workers、warmup_size、max_num_pixels 这几个都是非常重要的参数，这里需要增加API注释文档，用于解释这些参数是干嘛的，应该设置成多大比较合适。

需要简要描述一下这个Op的原理和设计。

GPU 支持 Warmup，可以在第一次运行便为 nvjpeg 分配尽可能足够的内存，避免运行过程中的内存分配

这里的Warmup逻辑是什么？在第一次运行时，把剩余的内存都申请下来用于decode？

一次让nvjpeg申请足够的内存，避免后面的内存重分配

chengtbf · 2020-10-05T02:42:32Z

这个新op，后续要更新到benchmark的cnn脚本里吧 @liujuncheng @ouyangyu 还是已经更新过了？

yuanms2 · 2020-10-05T02:50:44Z

oneflow/python/ops/user_data_ops.py

+        .RemoteBlobList()[0]
+    )
+
+


为什么会需要一个OFRecordBytesDecoder，直接从OFRecord 里的bytelist 开始解码应该也是可以吧，OFRecord 已经在内存里了。

为什么会需要一个OFRecordBytesDecoder，直接从OFRecord 里的bytelist 开始解码应该也是可以吧，OFRecord 已经在内存里了。

未来可能有更多的Dataset的支持，比如直接读ImageNet的文件而不是做成ofrecord，或者直接读取tfrecord等，只需要实现对应的OFRecordBytesDecoder，而不用修改各种DecodeOp

liujuncheng · 2020-10-05T03:19:37Z

这个新op，后续要更新到benchmark的cnn脚本里吧 @liujuncheng @ouyangyu 还是已经更新过了？

这里产生的改动不大，等合并后改就来得及

yuanms2 · 2020-10-05T04:01:59Z

oneflow/core/kernel/image_decoder_random_crop_resize_kernel.cpp

+  };
+  for (int64_t i = 0; i < batch_size; ++i) {
+    random_crop_generators_.at(i).reset(
+        new RandomCropGenerator(aspect_ratio_range, area_range, seeds.at(i), conf.num_attempts()));


random_crop_generators_ 设置成batch_size，也可以设置成num_workers吧。
GPU Decoder 也需要多线程解码吗？

这个版本的cpu decoder 比原来的cpu decoder 是不是也会快一点。

random_crop_generators_ 设置成batch_size，也可以设置成num_workers吧。
GPU Decoder 也需要多线程解码吗？

这里是因为worker的处理图片的顺序是不确定的，设置为batch_size可以得到确定的crop_window（确定是指同样的seed，同一个batch里面的特定图片的crop_window相同，可复现），另一方面是和现有的RandomCrop行为一致

GPU Decoder 也需要多线程解码吗？

GPU Decoder 对CPU的需求也很高，只是没有纯CPU那么高

lixinqi · 2020-10-05T04:20:53Z

oneflow/core/kernel/image_decoder_random_crop_resize_kernel.cpp

+  Channel<std::shared_ptr<Work>> work_queue_;
+  std::thread worker_thread_;
+
+  void PollWork(const std::function<std::shared_ptr<DecodeHandle>()>& handle_factory,


这里边涉及到了std::thread、Channel和BlockingCounter。这是不是就是MultiThreadLoop的功能啊？
总的状态保存在kernel的对象里或者ForwardDataContent的局部变量里，MultiThreadLoop里通过i获取各自的状态，是不是可以简化这里的大部分工作？

这里边涉及到了std::thread、Channel和BlockingCounter。这是不是就是MultiThreadLoop的功能啊？
总的状态保存在kernel的对象里或者ForwardDataContent的局部变量里，MultiThreadLoop里通过i获取各自的状态，是不是可以简化这里的大部分工作？

MultiThreadLoop适合每个子任务比较均衡的情况，这里每个子任务的开销很不均衡且无法预先准确的估算。

比如最大的图片可能达到 8000x6000，解码一张图片需要的时间比解码所有其他的时间加起来可能还多，那么理想情况下应该有一个worker专门处理这张图片，其他worker处理其他图片，这就是按大小降序排序以及通过　atomic int 分配任务的目的。

如果担心因共用公共计算线程池而导致性能受损，可以给MultiThreadLoop加线程池参数。在kernel初始化那里创建线程池，给后面的MultiThreadLoop使用。

你好像说到了目前MultiThreadLoop的缺陷。以后MultiThreadLoop也应该朝这个方向优化才行。

lixinqi · 2020-10-05T04:46:57Z

oneflow/core/kernel/image_decoder_random_crop_resize_kernel.cpp

+  Channel<std::shared_ptr<Work>> work_queue_;
+  std::thread worker_thread_;
+
+  void PollWork(const std::function<std::shared_ptr<DecodeHandle>()>& handle_factory,


你好像说到了目前MultiThreadLoop的缺陷。以后MultiThreadLoop也应该朝这个方向优化才行。

lixinqi · 2020-10-05T04:49:35Z

oneflow/core/kernel/image_decoder_random_crop_resize_kernel.cpp

+      CHECK_EQ(status, ChannelStatus::kChannelStatusSuccess);
+      handle->WarmupOnce(warmup_size, work->workspace, work->workspace_size);
+      while (true) {
+        const int task_id = work->task_counter->fetch_add(1, std::memory_order_relaxed);


所有Worker在这里争抢这个atomic值，每抢到一个就立刻做，是吗？
这是不是就是一个已知工作总量的单生产者多消费者模型？

所有Worker在这里争抢这个atomic值，每抢到一个就立刻做，是吗？
这是不是就是一个已知工作总量的单生产者多消费者模型？

其实就是一个共享任务队列的线程池，如果我们实现一个基于共享任务队列 + 无锁任务队列的线程池，效果是一样的

所有Worker在这里争抢这个atomic值，每抢到一个就立刻做，是吗？
这是不是就是一个已知工作总量的单生产者多消费者模型？

还有一个原因是 nvjpeg 那些 handle 不是线程安全的。

* Fused ImageDecoderRandomCropResize * format * fix cpu build * Add comment * dest=>dst * Add comment * Add ImageDecoderRandomCropResizeOp::InferParallelSignature (#3646) * Add ImageDecoderRandomCropResizeOp::InferParallelSignature * Add Scope::scope_proto * refine Former-commit-id: ecb6455

Fused ImageDecoderRandomCropResize

11f934f

liujuncheng requested review from chengtbf, lixinqi and guo-ran October 4, 2020 08:34

liujuncheng added 2 commits October 4, 2020 16:46

format

d4ccd01

fix cpu build

aee1afa

lixinqi reviewed Oct 4, 2020

View reviewed changes

oneflow/core/operator/image_decoder_random_crop_resize_op.cpp Show resolved Hide resolved

lixinqi reviewed Oct 4, 2020

View reviewed changes

Add comment

4aea78e

lixinqi reviewed Oct 4, 2020

View reviewed changes

dest=>dst

5d412ea

lixinqi reviewed Oct 4, 2020

View reviewed changes

liujuncheng added 2 commits October 4, 2020 19:44

Add comment

d166881

Add ImageDecoderRandomCropResizeOp::InferParallelSignature (#3646)

7b7d3ae

* Add ImageDecoderRandomCropResizeOp::InferParallelSignature * Add Scope::scope_proto

yuanms2 reviewed Oct 5, 2020

View reviewed changes

oneflow/core/device/cuda_util.h Outdated Show resolved Hide resolved

yuanms2 reviewed Oct 5, 2020

View reviewed changes

chengtbf reviewed Oct 5, 2020

View reviewed changes

chengtbf approved these changes Oct 5, 2020

View reviewed changes

yuanms2 reviewed Oct 5, 2020

View reviewed changes

liujuncheng added 2 commits October 5, 2020 11:02

refine

dd1af8d

Merge branch 'master' into dev_image_decoder_random_crop_resize

be31f64

chengtbf added this to the 0.2.0 milestone Oct 5, 2020

chengtbf added the feature label Oct 5, 2020

chengtbf assigned liujuncheng Oct 5, 2020

Merge branch 'master' into dev_image_decoder_random_crop_resize

1327219

yuanms2 reviewed Oct 5, 2020

View reviewed changes

lixinqi reviewed Oct 5, 2020

View reviewed changes

yuanms2 approved these changes Oct 5, 2020

View reviewed changes

liujuncheng merged commit ecb6455 into master Oct 5, 2020

liujuncheng deleted the dev_image_decoder_random_crop_resize branch October 5, 2020 07:48

Fused ImageDecoderRandomCropResize #3644

Fused ImageDecoderRandomCropResize #3644

Conversation

liujuncheng commented Oct 4, 2020 • edited Loading

使用

性能

精度

已知问题

TODO

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chengtbf commented Oct 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chengtbf Oct 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chengtbf commented Oct 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liujuncheng commented Oct 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liujuncheng commented Oct 4, 2020 •

edited

Loading

chengtbf Oct 5, 2020 •

edited

Loading