Batch inference support #34

jstumpin · 2020-06-23T07:18:36Z

How do we extend the inference function to support batchSize > 1? For batched inputs, I'm using OpenCV's blobFromImages. It seems to work just fine as batchSize = 1 (using model/weights with maxBatchSize of 2). But how do I parse the output? How do I get to the subsequent batchId?

Thanks.

CaoWGG · 2020-06-23T07:29:49Z

@jstumpin
you need to develop doNms and resizeAndNorm to support batch infer.

jstumpin · 2020-06-23T08:55:50Z

For post-processing, I'm using OpenCV's NMS and as for pre-processing, I'm using letterboxing from NVIDIA's original YOLO repo. Just couldn't figure out how to offset mCudaBuffers to get to the next batchId since number of detections is extracted from mCudaBuffers[1].

CaoWGG · 2020-06-23T09:03:55Z

@jstumpin
you can refer to ttps://github.com/CaoWGG/TensorRT-YOLOv4/blob/4d7c2edce99e8794a4cb4ea3540d51ce91158a36/onnx-tensorrt/yolo.cu#L52

jstumpin · 2020-06-23T10:16:13Z

If yololayer is readily supporting batching, then, how do I get number of detections (det) for subsequent output of the output batch? Because from yoloNet::Infer, outputData is populated from mCudaBuffers[1] where count = sizeof(float) + int(det)7 sizeof(float) and in the previous line, det is extracted from mCudaBuffers[1] where count = sizeof(int).

yiwenwan2008 · 2020-07-12T06:05:24Z

@jstumpin have you figured it out?

yiwenwan2008 · 2020-07-13T04:21:22Z

If yololayer is readily supporting batching, then, how do I get number of detections (det) for subsequent output of the output batch? Because from yoloNet::Infer, outputData is populated from mCudaBuffers[1] where count = sizeof(float) + int(det)7 sizeof(float) and in the previous line, det is extracted from mCudaBuffers[1] where count = sizeof(int).

did you set batchSize instead of 1, mContext->execute(batchSize,&mCudaBuffers[0]);
if batchSize doesnt equals to 1, batchId will be the key for grouping detections. refer trt.cpp:L48 int batchId = temp[6];

yiwenwan2008 · 2020-07-13T04:41:04Z

However, the results is not looking right even though i passed in with exactly same images.

jstumpin · 2020-07-13T08:33:39Z

@CaoWGG was succinctly precise to point out data[6] = batch_id, that is literally the solution. There is no need to offset outputData to extract subsequent output, everything is already set in the infer function. So here's what I did @yiwenwan2008:

Convert Darknet weights into TensorRT weights via buildEngine with maxBatchSize = 2;
Clone an image and flipped it vertically to emulate batchSize = 2;
Letterbox the images via NVIDIA's original YOLO repo;
Convert vector of images into inputBlob via blobFromImages;
Do the inference with batchSize = 2;
Accumulate the output according to int(temp[6]) where float* temp = outputData.get() + 1;

Thanks again @CaoWGG for this speedy wrapper (fastest so far on Windows)!

yiwenwan2008 · 2020-07-13T16:13:30Z

@jstumpin Thank you for your solution. I will try your steps for batch > 1; Right now I am trying to make sure when batch=1, i am getting valid results. However, when i use blobfromImages i got different results than using resizeandnorm(); is there anything that i overlooked? As you can see, the boundingbox for the dog is not right and confidence level also changed(the result image above is when blobFromImages() is used).

   cv::Mat blob = cv::dnn::blobFromImages(images, 1.0/255.0, cv::Size(inputDim.d[2],inputDim.d[1]), cv::Scalar(0,0,0), true, false, CV_32F);
    CUDA_CHECK(cudaMemcpy(mCudaBuffers[0],blob.data,416*416*3*batchSize,cudaMemcpyHostToDevice));

jstumpin · 2020-07-15T03:41:31Z

@yiwenwan2008 as mentioned previously, blobfromImages is used for converting the images into input blob; I'm using this for pre-processing. Anyhow, here's the result:

yiwenwan2008 · 2020-07-15T03:52:06Z

Thank! @jstumpin let me check what preprocessing you have done. By the way have you tried batch > 2...are you able to get expected detections? when i tried batch = 4, batch_id is only 0 or 1 :(

jstumpin · 2020-07-15T03:56:38Z

@CaoWGG I had to reset mCudaBuffers whenever batchSize is switched (e.g.: from 2 to 1) by re-initiating L161-L174. I don't have to re-initiate anything in the original NVIDIA YOLO repo. Although there isn't any noticeable overhead introduced by this re-initiation process, is there anything I can do to simplify things?

yiwenwan2008 · 2020-07-15T04:02:55Z

@jstumpin i do think image resize method matters. why we are getting different results is due to blobfromimage(), resizeandnorm() and cv::resize(m_OrigImage, m_LetterboxImage, cv::Size(resizeW, resizeH), 0, 0, cv::INTER_CUBIC) have different resize effect on the input images.

yiwenwan2008 · 2020-07-15T04:10:09Z

@jstumpin since i am also using darknet to train models, i need to use darknet's image preprocessing method. Thank you! Thank you for your great work and generous sharing @CaoWGG

jstumpin · 2020-07-15T04:16:58Z

@jstumpin i do think image resize method matters. why we are getting different results is due to blobfromimage(), resizeandnorm() and cv::resize(m_OrigImage, m_LetterboxImage, cv::Size(resizeW, resizeH), 0, 0, cv::INTER_CUBIC) have different resize effect on the input images.

I'm sure it does, never said it doesn't. Just clarifying that I didn't place any additional parameter into blobfromimage to bypass its internal pre-processing steps.

Thank! @jstumpin let me check what preprocessing you have done. By the way have you tried batch > 2...are you able to get expected detections? when i tried batch = 4, batch_id is only 0 or 1 :(

Did a quick batchSize = 4 conversion and the results sum up quite nicely.

yiwenwan2008 · 2020-07-15T21:36:31Z

@jstumpin it is so good to see batch inference is working for you. Did you try to see the speed gain with batch inference?

yiwenwan2008 · 2020-07-15T23:27:34Z

@jstumpin i am still not able to make it work with batchSize > 1; in initEngine(), nbBindings=4, and mCudaBuffers has size 4, input is mCudaBuffers[0], output is mCudaBuffers[1], do you know why mCudaBuffers has size 4 and mCudaBuffers[2] and mCudaBuffer[3] points to mCudaBuffers[1]....we only have output for one binding instead of three...

jstumpin · 2020-07-17T04:09:08Z

@jstumpin it is so good to see batch inference is working for you. Did you try to see the speed gain with batch inference?

Haven't done the full benchmark. Pending to compare against https://github.com/enazoe/yolo-tensorrt and opencv/opencv#17795 (comment) with the latter seems to be more promising.

jstumpin · 2020-07-17T04:14:16Z

@jstumpin i am still not able to make it work with batchSize > 1; in initEngine(), nbBindings=4, and mCudaBuffers has size 4, input is mCudaBuffers[0], output is mCudaBuffers[1], do you know why mCudaBuffers has size 4 and mCudaBuffers[2] and mCudaBuffer[3] points to mCudaBuffers[1]....we only have output for one binding instead of three...

mCudaBuffers needs to be of size 4 due to 1 input + 3 yolo output layers. As for the output having a single binding, unlike three (even the original NVIDIA is having three D2H cudaMemcpy), I'd reckon the author would have the answer.

yiwenwan2008 · 2020-07-17T16:35:31Z

@jstumpin i was also looking into opencv dnn module, trying to make cuda work, fps using cpu is quiet low around 1fps

spacewalk01 · 2021-02-15T06:33:48Z

@jstumpin
you need to develop doNms and resizeAndNorm to support batch infer.

@CaoWGG Hi, thank you for your wonderful implementation. I tried some preprocessing functions using opencv dnn. However, I noticed that your resize and norm kernel implementation runs much faster than OpenCV ‘dnn’. There are two ways to do computation in gpu as you know. I noticed that yours converts 2D images with 3 channels into 1D grid which works wonderfully. However, if I want to implement preprocessing kernel function (resizeAndNorm) for batch data, I wonder which grid, 1D grid or 2D grid should be better. I would appreciate your suggestion thank you.

1D ->  blockIdx.x | blockIdx.x * blockDim.x + threadIdx.x
2D -> blockIdx.x | blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x

Also, I noticed that doNms does not use gpu kernel and I would like to know the reason? why not use gpu for post processing?

jstumpin · 2021-02-18T01:32:23Z

@yiwenwan2008 @batselem Perhaps can consider NVIDIA official support for YOLOv4: https://github.com/NVIDIA-AI-IOT/yolov4_deepstream (includes GPU-based post-processing via batchedNMSplugin); benchmark can be found here.

spacewalk01 · 2021-02-18T01:42:58Z

@jstumpin thank you for your suggestion. I will try it.

spacewalk01 · 2021-02-18T04:25:28Z

@jstumpin I found out that in the implementation you suggested, the author uses methods like cv::Size which runs on CPU. I tried this method before with both cv::cuda::Resize and cv::Size, they were both very slow.

    if (this->mParams.cocoTest)
    {
        for (int b = 0; b < inputB; ++b)
        {
            if (this->mImageIdx + b < this->mImageFiles.size())
            {
                cv::Mat test_img = cv::imread(this->mImageFiles[this->mImageIdx + b]);
                cv::Mat rgb_img;
                cv::cvtColor(test_img, rgb_img, cv::COLOR_BGR2RGB);
                cv::Mat pad_dst;
                cv::Scalar value(0, 0, 0);
                auto scaleSize = cv::Size(inputW, inputH);
    }```

jstumpin · 2021-02-18T04:57:32Z

@batselem For the given test.jpg (4134x1653) example here, cv::cuda::Resize with the typical memory copy would give me, on average, 0.452ms while cv::resize's 1.631ms. The key to speed is to minimize overheads, namely H2D/D2H-copies. Thus, even if GPU resize is at par with CPU's, the overall latency would normally be in favor of the former method should we keep processing things persistently at one side of the hardware pipeline, e.g. for the said benchmark, cv::cudacodec::createVideoReader is used in lieu of cv::VideoCapture.

spacewalk01 · 2021-02-22T05:54:21Z

Thanks, @jstumpin I consider your suggestion!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch inference support #34

Batch inference support #34

jstumpin commented Jun 23, 2020

CaoWGG commented Jun 23, 2020

jstumpin commented Jun 23, 2020

CaoWGG commented Jun 23, 2020 •

edited

Loading

jstumpin commented Jun 23, 2020 •

edited

Loading

yiwenwan2008 commented Jul 12, 2020

yiwenwan2008 commented Jul 13, 2020

yiwenwan2008 commented Jul 13, 2020

jstumpin commented Jul 13, 2020

yiwenwan2008 commented Jul 13, 2020 •

edited

Loading

jstumpin commented Jul 15, 2020 •

edited

Loading

yiwenwan2008 commented Jul 15, 2020

jstumpin commented Jul 15, 2020

yiwenwan2008 commented Jul 15, 2020

yiwenwan2008 commented Jul 15, 2020

jstumpin commented Jul 15, 2020

yiwenwan2008 commented Jul 15, 2020

yiwenwan2008 commented Jul 15, 2020

jstumpin commented Jul 17, 2020

jstumpin commented Jul 17, 2020

yiwenwan2008 commented Jul 17, 2020

spacewalk01 commented Feb 15, 2021 •

edited

Loading

jstumpin commented Feb 18, 2021

spacewalk01 commented Feb 18, 2021

spacewalk01 commented Feb 18, 2021 •

edited

Loading

jstumpin commented Feb 18, 2021

spacewalk01 commented Feb 22, 2021

Batch inference support #34

Batch inference support #34

Comments

jstumpin commented Jun 23, 2020

CaoWGG commented Jun 23, 2020

jstumpin commented Jun 23, 2020

CaoWGG commented Jun 23, 2020 • edited Loading

jstumpin commented Jun 23, 2020 • edited Loading

yiwenwan2008 commented Jul 12, 2020

yiwenwan2008 commented Jul 13, 2020

yiwenwan2008 commented Jul 13, 2020

jstumpin commented Jul 13, 2020

yiwenwan2008 commented Jul 13, 2020 • edited Loading

jstumpin commented Jul 15, 2020 • edited Loading

yiwenwan2008 commented Jul 15, 2020

jstumpin commented Jul 15, 2020

yiwenwan2008 commented Jul 15, 2020

yiwenwan2008 commented Jul 15, 2020

jstumpin commented Jul 15, 2020

yiwenwan2008 commented Jul 15, 2020

yiwenwan2008 commented Jul 15, 2020

jstumpin commented Jul 17, 2020

jstumpin commented Jul 17, 2020

yiwenwan2008 commented Jul 17, 2020

spacewalk01 commented Feb 15, 2021 • edited Loading

jstumpin commented Feb 18, 2021

spacewalk01 commented Feb 18, 2021

spacewalk01 commented Feb 18, 2021 • edited Loading

jstumpin commented Feb 18, 2021

spacewalk01 commented Feb 22, 2021

CaoWGG commented Jun 23, 2020 •

edited

Loading

jstumpin commented Jun 23, 2020 •

edited

Loading

yiwenwan2008 commented Jul 13, 2020 •

edited

Loading

jstumpin commented Jul 15, 2020 •

edited

Loading

spacewalk01 commented Feb 15, 2021 •

edited

Loading

spacewalk01 commented Feb 18, 2021 •

edited

Loading