Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch inference support #34

Open
jstumpin opened this issue Jun 23, 2020 · 26 comments
Open

Batch inference support #34

jstumpin opened this issue Jun 23, 2020 · 26 comments

Comments

@jstumpin
Copy link

How do we extend the inference function to support batchSize > 1? For batched inputs, I'm using OpenCV's blobFromImages. It seems to work just fine as batchSize = 1 (using model/weights with maxBatchSize of 2). But how do I parse the output? How do I get to the subsequent batchId?

Thanks.

@CaoWGG
Copy link
Owner

CaoWGG commented Jun 23, 2020

@jstumpin
you need to develop doNms and resizeAndNorm to support batch infer.

@jstumpin
Copy link
Author

For post-processing, I'm using OpenCV's NMS and as for pre-processing, I'm using letterboxing from NVIDIA's original YOLO repo. Just couldn't figure out how to offset mCudaBuffers to get to the next batchId since number of detections is extracted from mCudaBuffers[1].

@CaoWGG
Copy link
Owner

CaoWGG commented Jun 23, 2020

@jstumpin
you can refer to ttps://github.com/CaoWGG/TensorRT-YOLOv4/blob/4d7c2edce99e8794a4cb4ea3540d51ce91158a36/onnx-tensorrt/yolo.cu#L52

@jstumpin
Copy link
Author

jstumpin commented Jun 23, 2020

If yololayer is readily supporting batching, then, how do I get number of detections (det) for subsequent output of the output batch? Because from yoloNet::Infer, outputData is populated from mCudaBuffers[1] where count = sizeof(float) + int(det)7 sizeof(float) and in the previous line, det is extracted from mCudaBuffers[1] where count = sizeof(int).

@yiwenwan2008
Copy link

@jstumpin have you figured it out?

@yiwenwan2008
Copy link

If yololayer is readily supporting batching, then, how do I get number of detections (det) for subsequent output of the output batch? Because from yoloNet::Infer, outputData is populated from mCudaBuffers[1] where count = sizeof(float) + int(det)7 sizeof(float) and in the previous line, det is extracted from mCudaBuffers[1] where count = sizeof(int).

did you set batchSize instead of 1, mContext->execute(batchSize,&mCudaBuffers[0]);
if batchSize doesnt equals to 1, batchId will be the key for grouping detections. refer trt.cpp:L48 int batchId = temp[6];

@yiwenwan2008
Copy link

However, the results is not looking right even though i passed in with exactly same images.

@jstumpin
Copy link
Author

@CaoWGG was succinctly precise to point out data[6] = batch_id, that is literally the solution. There is no need to offset outputData to extract subsequent output, everything is already set in the infer function. So here's what I did @yiwenwan2008:

  1. Convert Darknet weights into TensorRT weights via buildEngine with maxBatchSize = 2;
  2. Clone an image and flipped it vertically to emulate batchSize = 2;
  3. Letterbox the images via NVIDIA's original YOLO repo;
  4. Convert vector of images into inputBlob via blobFromImages;
  5. Do the inference with batchSize = 2;
  6. Accumulate the output according to int(temp[6]) where float* temp = outputData.get() + 1;

Thanks again @CaoWGG for this speedy wrapper (fastest so far on Windows)!

@yiwenwan2008
Copy link

yiwenwan2008 commented Jul 13, 2020

@jstumpin Thank you for your solution. I will try your steps for batch > 1; Right now I am trying to make sure when batch=1, i am getting valid results. However, when i use blobfromImages i got different results than using resizeandnorm(); is there anything that i overlooked? As you can see, the boundingbox for the dog is not right and confidence level also changed(the result image above is when blobFromImages() is used).

   cv::Mat blob = cv::dnn::blobFromImages(images, 1.0/255.0, cv::Size(inputDim.d[2],inputDim.d[1]), cv::Scalar(0,0,0), true, false, CV_32F);
    CUDA_CHECK(cudaMemcpy(mCudaBuffers[0],blob.data,416*416*3*batchSize,cudaMemcpyHostToDevice));

result
valid_result

@jstumpin
Copy link
Author

jstumpin commented Jul 15, 2020

@yiwenwan2008 as mentioned previously, blobfromImages is used for converting the images into input blob; I'm using this for pre-processing. Anyhow, here's the result:

yoloeddog

@yiwenwan2008
Copy link

Thank! @jstumpin let me check what preprocessing you have done. By the way have you tried batch > 2...are you able to get expected detections? when i tried batch = 4, batch_id is only 0 or 1 :(

@jstumpin
Copy link
Author

@CaoWGG I had to reset mCudaBuffers whenever batchSize is switched (e.g.: from 2 to 1) by re-initiating L161-L174. I don't have to re-initiate anything in the original NVIDIA YOLO repo. Although there isn't any noticeable overhead introduced by this re-initiation process, is there anything I can do to simplify things?

@yiwenwan2008
Copy link

@jstumpin i do think image resize method matters. why we are getting different results is due to blobfromimage(), resizeandnorm() and cv::resize(m_OrigImage, m_LetterboxImage, cv::Size(resizeW, resizeH), 0, 0, cv::INTER_CUBIC) have different resize effect on the input images.

@yiwenwan2008
Copy link

@jstumpin since i am also using darknet to train models, i need to use darknet's image preprocessing method. Thank you! Thank you for your great work and generous sharing @CaoWGG

@jstumpin
Copy link
Author

@jstumpin i do think image resize method matters. why we are getting different results is due to blobfromimage(), resizeandnorm() and cv::resize(m_OrigImage, m_LetterboxImage, cv::Size(resizeW, resizeH), 0, 0, cv::INTER_CUBIC) have different resize effect on the input images.

I'm sure it does, never said it doesn't. Just clarifying that I didn't place any additional parameter into blobfromimage to bypass its internal pre-processing steps.

Thank! @jstumpin let me check what preprocessing you have done. By the way have you tried batch > 2...are you able to get expected detections? when i tried batch = 4, batch_id is only 0 or 1 :(

Did a quick batchSize = 4 conversion and the results sum up quite nicely.

@yiwenwan2008
Copy link

@jstumpin it is so good to see batch inference is working for you. Did you try to see the speed gain with batch inference?

@yiwenwan2008
Copy link

@jstumpin i am still not able to make it work with batchSize > 1; in initEngine(), nbBindings=4, and mCudaBuffers has size 4, input is mCudaBuffers[0], output is mCudaBuffers[1], do you know why mCudaBuffers has size 4 and mCudaBuffers[2] and mCudaBuffer[3] points to mCudaBuffers[1]....we only have output for one binding instead of three...

@jstumpin
Copy link
Author

@jstumpin it is so good to see batch inference is working for you. Did you try to see the speed gain with batch inference?

Haven't done the full benchmark. Pending to compare against https://github.com/enazoe/yolo-tensorrt and opencv/opencv#17795 (comment) with the latter seems to be more promising.

@jstumpin
Copy link
Author

@jstumpin i am still not able to make it work with batchSize > 1; in initEngine(), nbBindings=4, and mCudaBuffers has size 4, input is mCudaBuffers[0], output is mCudaBuffers[1], do you know why mCudaBuffers has size 4 and mCudaBuffers[2] and mCudaBuffer[3] points to mCudaBuffers[1]....we only have output for one binding instead of three...

mCudaBuffers needs to be of size 4 due to 1 input + 3 yolo output layers. As for the output having a single binding, unlike three (even the original NVIDIA is having three D2H cudaMemcpy), I'd reckon the author would have the answer.

@yiwenwan2008
Copy link

@jstumpin i was also looking into opencv dnn module, trying to make cuda work, fps using cpu is quiet low around 1fps

@spacewalk01
Copy link

spacewalk01 commented Feb 15, 2021

@jstumpin
you need to develop doNms and resizeAndNorm to support batch infer.

@CaoWGG Hi, thank you for your wonderful implementation. I tried some preprocessing functions using opencv dnn. However, I noticed that your resize and norm kernel implementation runs much faster than OpenCV ‘dnn’. There are two ways to do computation in gpu as you know. I noticed that yours converts 2D images with 3 channels into 1D grid which works wonderfully. However, if I want to implement preprocessing kernel function (resizeAndNorm) for batch data, I wonder which grid, 1D grid or 2D grid should be better. I would appreciate your suggestion thank you.

1D ->  blockIdx.x | blockIdx.x * blockDim.x + threadIdx.x
2D -> blockIdx.x | blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x

Also, I noticed that doNms does not use gpu kernel and I would like to know the reason? why not use gpu for post processing?

@jstumpin
Copy link
Author

@yiwenwan2008 @batselem Perhaps can consider NVIDIA official support for YOLOv4: https://github.com/NVIDIA-AI-IOT/yolov4_deepstream (includes GPU-based post-processing via batchedNMSplugin); benchmark can be found here.

@spacewalk01
Copy link

@jstumpin thank you for your suggestion. I will try it.

@spacewalk01
Copy link

spacewalk01 commented Feb 18, 2021

@jstumpin I found out that in the implementation you suggested, the author uses methods like cv::Size which runs on CPU. I tried this method before with both cv::cuda::Resize and cv::Size, they were both very slow.

    if (this->mParams.cocoTest)
    {
        for (int b = 0; b < inputB; ++b)
        {
            if (this->mImageIdx + b < this->mImageFiles.size())
            {
                cv::Mat test_img = cv::imread(this->mImageFiles[this->mImageIdx + b]);
                cv::Mat rgb_img;
                cv::cvtColor(test_img, rgb_img, cv::COLOR_BGR2RGB);
                cv::Mat pad_dst;
                cv::Scalar value(0, 0, 0);
                auto scaleSize = cv::Size(inputW, inputH);
    }```

@jstumpin
Copy link
Author

@batselem For the given test.jpg (4134x1653) example here, cv::cuda::Resize with the typical memory copy would give me, on average, 0.452ms while cv::resize's 1.631ms. The key to speed is to minimize overheads, namely H2D/D2H-copies. Thus, even if GPU resize is at par with CPU's, the overall latency would normally be in favor of the former method should we keep processing things persistently at one side of the hardware pipeline, e.g. for the said benchmark, cv::cudacodec::createVideoReader is used in lieu of cv::VideoCapture.

@spacewalk01
Copy link

Thanks, @jstumpin I consider your suggestion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants