-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fused ImageDecoderRandomCropResize #3644
Conversation
std::shared_ptr<BlockingCounter> done_counter(new BlockingCounter(workers_.size())); | ||
std::shared_ptr<std::atomic<int>> task_counter(new std::atomic<int>(0)); | ||
std::shared_ptr<std::vector<Task>> tasks(new std::vector<Task>(batch_size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感觉这些指针都可以用unique_ptr,它的成本更低。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感觉这些指针都可以用unique_ptr,它的成本更低。
这个几个对象多个work是共享的
oneflow/core/kernel/image_decoder_random_crop_resize_kernel.cpp
Outdated
Show resolved
Hide resolved
oneflow/core/kernel/image_decoder_random_crop_resize_kernel.cpp
Outdated
Show resolved
Hide resolved
std::shared_ptr<std::atomic<int>> task_counter; | ||
}; | ||
|
||
void GenerateRandomCropRoi(RandomCropGenerator* crop_generator, int width, int height, int* roi_x, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roi是不是应该取名为Region啊?因为roi的意思是region of interst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roi是不是应该取名为Region啊?因为roi的意思是region of interst
Roi应该是图片处理的通用叫法,参考 https://docs.nvidia.com/cuda/archive/10.2/nvjpeg/index.html#nvjpeg-decode-params-set-roi 也是叫 roi
dev_allocator_.dev_malloc = &GpuDeviceMalloc; | ||
dev_allocator_.dev_free = &GpuDeviceFree; | ||
pinned_allocator_.pinned_malloc = &GpuPinnedMalloc; | ||
pinned_allocator_.pinned_free = &GpuPinnedFree; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是不是有优化空间,以后可以使用成诚写的内存分配器?如果可以的话,加一个TODO(liujuncheng)
warmup_done_ = true; | ||
} | ||
|
||
void GpuDecodeHandle::Synchronize() { OF_CUDA_CHECK(cudaStreamSynchronize(cuda_stream_)); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里会不会遇到去年年底因linux内核版本不一样,线程唤醒时间变长导致效率没有busy loop快的问题?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里会不会遇到去年年底因linux内核版本不一样,线程唤醒时间变长导致效率没有busy loop快的问题?
不排除这种可能,但是暂时还没有遇到
目前这个版本是一个基本的实现,stream的使用以及allocator等还有很多优化空间,未来这里可能会根据情况重构,之后stream等的使用方式和现在会不一样。
* Add ImageDecoderRandomCropResizeOp::InferParallelSignature * Add Scope::scope_proto
|
||
TaskType GetTaskType() const override { return TaskType::kDecodeH2D; } | ||
CudaWorkType GetCudaWorkType() const override { | ||
#ifdef WITH_CUDA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是否也判断下cuda 版本
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是否也判断下cuda 版本
这里不需要判断版本,DecodeH2D用于可以直接将数据从cpu解码到gpu的op,类似CopyHD,但是CopyHD是异步的,DecodeH2D可能没有办法异步,为了避免影响CopyHD,单独作为了一种Task类型。只有nvjpeg限制cuda版本
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果需要独立的线程,需要重写这类的TaskNode::IsIndependent()方法。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果需要独立的线程,需要重写这类的TaskNode::IsIndependent()方法。
用的是GPU上的特定线程,不是独立线程,GPU目前应该也不支持独立线程吧
} | ||
|
||
void DecodeH2DCompTaskNode::ProduceAllRegstsAndBindEdges() { | ||
std::shared_ptr<RegstDesc> out_regst = ProduceRegst("out", false, 2, 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我感觉使用了这个DecodeH2D以后,我们之前CopyH2D regst num 强制 = 2的trick可以去掉了。这样还能优化boxing v2带来的2倍模型显存的开销?
out->mut_shape() = Shape(out_dim_vec); | ||
BlobDesc* tmp = GetBlobDesc4BnInOp("tmp"); | ||
tmp->set_data_type(DataType::kUInt8); | ||
tmp->mut_shape() = Shape({conf.max_num_pixels() * 3 * conf.num_workers()}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个tmp是自己强行分配了一个足够大的内存吧,shape是不是有可能不够?在极端情况下(有一张特别大的图片)。在大多数情况下会很浪费内存。
DALI的做法是动态申请GPU内存(但是是非常稀疏的,初始就分配了1M,在绝大多数情况下都能满足),如果内存不够,就释放掉重新申请一个更大的。如果后面的图片都是很小的图片,内存会被释放再申请一个小一点的。 (不过在深度学习的需求里,内存重新申请没有太大意义)
我们目前可以保留静态大buffer的方案。但是这个阈值要设置的合理一些,不要因为这个tmp过大导致我们的batch size不能设置到足够大。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
哦哦哦我这里的理解不对,你在kernel里有分配临时内存的接口。这个tmp blob是干嘛用的?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个tmp是自己强行分配了一个足够大的内存吧,shape是不是有可能不够?在极端情况下(有一张特别大的图片)。在大多数情况下会很浪费内存。
DALI的做法是动态申请GPU内存(但是是非常稀疏的,初始就分配了1M,在绝大多数情况下都能满足),如果内存不够,就释放掉重新申请一个更大的。如果后面的图片都是很小的图片,内存会被释放再申请一个小一点的。 (不过在深度学习的需求里,内存重新申请没有太大意义)
我们目前可以保留静态大buffer的方案。但是这个阈值要设置的合理一些,不要因为这个tmp过大导致我们的batch size不能设置到足够大。
这个 buffer 是 decode 和 resize 之间的临时内存,这部分都是我们自己使用的,所以没有用动态方案。
除非有实现比较好的内存池,否者这里动态申请内存对性能影响非常大,tmp buffer 以及 warmup 都是为了避免内存重分配的。
目前 tmp 的大小是按照imagenet数据集的基础上又保留了一些余量,大部分场景是够用的。
这里的Warmup逻辑是什么?在第一次运行时,把剩余的内存都申请下来用于decode? |
void GpuDecodeHandle::WarmupOnce(int warmup_size, unsigned char* workspace, size_t workspace_size) { | ||
if (warmup_done_) { return; } | ||
warmup_size = std::min(static_cast<int>(std::sqrt(workspace_size / kNumChannels)), warmup_size); | ||
cv::Mat image = cv::Mat::zeros(cv::Size(warmup_size, warmup_size), CV_8UC3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的逻辑是创建一个(warmup_size, warmup_size, 3)大小的全零图片,然后尝试decode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个warmup size是用户配置的?我见是conf里传进来的。用户怎么知道要传多大?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
还是说这个size其实可以随便设置,然后在训练过程中,只要遇到了更大的图片,就会触发重新malloc的逻辑?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
看了一下DALI的文档,DALI是设置的16M显存,一般就够用了
device_memory_padding (int, optional, default = 16777216) –
`mixed` backend only Padding for nvJPEG’s device memory allocations
in bytes. This parameter helps to avoid reallocation in nvJPEG whenever
a bigger image is encountered and the internal buffer needs to be
reallocated to decode it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
看了一下DALI的文档,DALI是设置的16M显存,一般就够用了
device_memory_padding (int, optional, default = 16777216) – `mixed` backend only Padding for nvJPEG’s device memory allocations in bytes. This parameter helps to avoid reallocation in nvJPEG whenever a bigger image is encountered and the internal buffer needs to be reallocated to decode it.
这个是指分配显存的时候按照 device_memory_padding 来申请,避免频繁的重分配
处理imagenet数据集的时候,单张图片最大需要120M左右的显存
还是说这个size其实可以随便设置,然后在训练过程中,只要遇到了更大的图片,就会触发重新malloc的逻辑?
是的
random_aspect_ratio: Optional[Sequence[float]] = None, | ||
num_workers: Optional[int] = None, | ||
warmup_size: Optional[int] = None, | ||
max_num_pixels: Optional[int] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_workers、warmup_size、max_num_pixels 这几个都是非常重要的参数,这里需要增加API注释文档,用于解释这些参数是干嘛的,应该设置成多大比较合适。
需要简要描述一下这个Op的原理和设计。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果有可能的话,PR合并以后,补充增加一个单测,让CI帮助检查。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
增加单测也是让其他人知道这个op的使用方式。如果不方便,那就在
/oneflow/oneflow/python/test/customized
路径下补充一个示例(就是把PR的第一个comment里的使用代码粘过去),就像我之前增加的new_data_pre.py一样。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_workers、warmup_size、max_num_pixels 这几个都是非常重要的参数,这里需要增加API注释文档,用于解释这些参数是干嘛的,应该设置成多大比较合适。
需要简要描述一下这个Op的原理和设计。
GPU 支持 Warmup,可以在第一次运行便为 nvjpeg 分配尽可能足够的内存,避免运行过程中的内存分配
这里的Warmup逻辑是什么?在第一次运行时,把剩余的内存都申请下来用于decode?
一次让nvjpeg申请足够的内存,避免后面的内存重分配
这个新op,后续要更新到benchmark的cnn脚本里吧 @liujuncheng @ouyangyu 还是已经更新过了? |
.RemoteBlobList()[0] | ||
) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么会需要一个OFRecordBytesDecoder,直接从OFRecord 里的bytelist 开始解码应该也是可以吧,OFRecord 已经在内存里了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么会需要一个OFRecordBytesDecoder,直接从OFRecord 里的bytelist 开始解码应该也是可以吧,OFRecord 已经在内存里了。
未来可能有更多的Dataset的支持,比如直接读ImageNet的文件而不是做成ofrecord,或者直接读取tfrecord等,只需要实现对应的OFRecordBytesDecoder,而不用修改各种DecodeOp
这里产生的改动不大,等合并后改就来得及 |
}; | ||
for (int64_t i = 0; i < batch_size; ++i) { | ||
random_crop_generators_.at(i).reset( | ||
new RandomCropGenerator(aspect_ratio_range, area_range, seeds.at(i), conf.num_attempts())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
random_crop_generators_ 设置成batch_size,也可以设置成num_workers吧。
GPU Decoder 也需要多线程解码吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个版本的cpu decoder 比原来的cpu decoder 是不是也会快一点。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
random_crop_generators_ 设置成batch_size,也可以设置成num_workers吧。
GPU Decoder 也需要多线程解码吗?
这里是因为worker的处理图片的顺序是不确定的,设置为batch_size可以得到确定的crop_window(确定是指同样的seed,同一个batch里面的特定图片的crop_window相同,可复现),另一方面是和现有的RandomCrop行为一致
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPU Decoder 也需要多线程解码吗?
GPU Decoder 对CPU的需求也很高,只是没有纯CPU那么高
Channel<std::shared_ptr<Work>> work_queue_; | ||
std::thread worker_thread_; | ||
|
||
void PollWork(const std::function<std::shared_ptr<DecodeHandle>()>& handle_factory, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里边涉及到了std::thread、Channel和BlockingCounter。这是不是就是MultiThreadLoop的功能啊?
总的状态保存在kernel的对象里或者ForwardDataContent的局部变量里,MultiThreadLoop里通过i获取各自的状态,是不是可以简化这里的大部分工作?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里边涉及到了std::thread、Channel和BlockingCounter。这是不是就是MultiThreadLoop的功能啊?
总的状态保存在kernel的对象里或者ForwardDataContent的局部变量里,MultiThreadLoop里通过i获取各自的状态,是不是可以简化这里的大部分工作?
MultiThreadLoop
适合每个子任务比较均衡的情况,这里每个子任务的开销很不均衡且无法预先准确的估算。
比如最大的图片可能达到 8000x6000,解码一张图片需要的时间比解码所有其他的时间加起来可能还多,那么理想情况下应该有一个worker专门处理这张图片,其他worker处理其他图片,这就是按大小降序排序以及通过 atomic int 分配任务的目的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果担心因共用公共计算线程池而导致性能受损,可以给MultiThreadLoop加线程池参数。在kernel初始化那里创建线程池,给后面的MultiThreadLoop使用。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
你好像说到了目前MultiThreadLoop的缺陷。以后MultiThreadLoop也应该朝这个方向优化才行。
Channel<std::shared_ptr<Work>> work_queue_; | ||
std::thread worker_thread_; | ||
|
||
void PollWork(const std::function<std::shared_ptr<DecodeHandle>()>& handle_factory, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
你好像说到了目前MultiThreadLoop的缺陷。以后MultiThreadLoop也应该朝这个方向优化才行。
CHECK_EQ(status, ChannelStatus::kChannelStatusSuccess); | ||
handle->WarmupOnce(warmup_size, work->workspace, work->workspace_size); | ||
while (true) { | ||
const int task_id = work->task_counter->fetch_add(1, std::memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
所有Worker在这里争抢这个atomic值,每抢到一个就立刻做,是吗?
这是不是就是一个已知工作总量的单生产者多消费者模型?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
所有Worker在这里争抢这个atomic值,每抢到一个就立刻做,是吗?
这是不是就是一个已知工作总量的单生产者多消费者模型?
其实就是一个共享任务队列的线程池,如果我们实现一个基于 共享任务队列 + 无锁任务队列 的线程池,效果是一样的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
所有Worker在这里争抢这个atomic值,每抢到一个就立刻做,是吗?
这是不是就是一个已知工作总量的单生产者多消费者模型?
还有一个原因是 nvjpeg 那些 handle 不是线程安全的。
ofrecord_bytes_decoder
op,用于将 OFRecord 中的 bytes_list decode 为 TensorBufferimage_decoder_random_crop_resize
op,实现 decode + random_crop + resize 功能,输入为 TensorBuffer,输出为静态形状,目前仅支持 NHWC RGB格式image_decoder_random_crop_resize
在 CUDA 10.2 以上支持 GPU,使用 nvjpeg 实现 decode,使用 npp 实现 resizeDecodeH2D
类型的 TaskNode,类似 CopyH2D,消费 host 内存,产生 gpu 内存,image_decoder_random_crop_resize
GPU 使用DecodeH2D
线程使用
性能
ResNet50 FP16 8 * V100 bs=256
GPU
CPU
精度
ResNet50 FP16 50 epochs 75.8%+, 90 epoch 77%+
已知问题
可能和目前的 conv 算法搜索有冲突,开启
train_config.cudnn_conv_heuristic_search_algo(True)
选项有可能提高性能TODO