-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch inference support #34
Comments
@jstumpin |
For post-processing, I'm using OpenCV's NMS and as for pre-processing, I'm using letterboxing from NVIDIA's original YOLO repo. Just couldn't figure out how to offset mCudaBuffers to get to the next batchId since number of detections is extracted from mCudaBuffers[1]. |
@jstumpin |
If yololayer is readily supporting batching, then, how do I get number of detections (det) for subsequent output of the output batch? Because from yoloNet::Infer, outputData is populated from mCudaBuffers[1] where count = sizeof(float) + int(det)7 sizeof(float) and in the previous line, det is extracted from mCudaBuffers[1] where count = sizeof(int). |
@jstumpin have you figured it out? |
did you set batchSize instead of 1, mContext->execute(batchSize,&mCudaBuffers[0]); |
However, the results is not looking right even though i passed in with exactly same images. |
@CaoWGG was succinctly precise to point out data[6] = batch_id, that is literally the solution. There is no need to offset outputData to extract subsequent output, everything is already set in the infer function. So here's what I did @yiwenwan2008:
Thanks again @CaoWGG for this speedy wrapper (fastest so far on Windows)! |
@jstumpin Thank you for your solution. I will try your steps for batch > 1; Right now I am trying to make sure when batch=1, i am getting valid results. However, when i use blobfromImages i got different results than using resizeandnorm(); is there anything that i overlooked? As you can see, the boundingbox for the dog is not right and confidence level also changed(the result image above is when blobFromImages() is used).
|
@yiwenwan2008 as mentioned previously, blobfromImages is used for converting the images into input blob; I'm using this for pre-processing. Anyhow, here's the result: |
Thank! @jstumpin let me check what preprocessing you have done. By the way have you tried batch > 2...are you able to get expected detections? when i tried batch = 4, batch_id is only 0 or 1 :( |
@CaoWGG I had to reset mCudaBuffers whenever batchSize is switched (e.g.: from 2 to 1) by re-initiating L161-L174. I don't have to re-initiate anything in the original NVIDIA YOLO repo. Although there isn't any noticeable overhead introduced by this re-initiation process, is there anything I can do to simplify things? |
@jstumpin i do think image resize method matters. why we are getting different results is due to blobfromimage(), resizeandnorm() and cv::resize(m_OrigImage, m_LetterboxImage, cv::Size(resizeW, resizeH), 0, 0, cv::INTER_CUBIC) have different resize effect on the input images. |
I'm sure it does, never said it doesn't. Just clarifying that I didn't place any additional parameter into blobfromimage to bypass its internal pre-processing steps.
Did a quick batchSize = 4 conversion and the results sum up quite nicely. |
@jstumpin it is so good to see batch inference is working for you. Did you try to see the speed gain with batch inference? |
@jstumpin i am still not able to make it work with batchSize > 1; in initEngine(), nbBindings=4, and mCudaBuffers has size 4, input is mCudaBuffers[0], output is mCudaBuffers[1], do you know why mCudaBuffers has size 4 and mCudaBuffers[2] and mCudaBuffer[3] points to mCudaBuffers[1]....we only have output for one binding instead of three... |
Haven't done the full benchmark. Pending to compare against https://github.com/enazoe/yolo-tensorrt and opencv/opencv#17795 (comment) with the latter seems to be more promising. |
mCudaBuffers needs to be of size 4 due to 1 input + 3 yolo output layers. As for the output having a single binding, unlike three (even the original NVIDIA is having three D2H cudaMemcpy), I'd reckon the author would have the answer. |
@jstumpin i was also looking into opencv dnn module, trying to make cuda work, fps using cpu is quiet low around 1fps |
@CaoWGG Hi, thank you for your wonderful implementation. I tried some preprocessing functions using opencv dnn. However, I noticed that your resize and norm kernel implementation runs much faster than OpenCV ‘dnn’. There are two ways to do computation in gpu as you know. I noticed that yours converts 2D images with 3 channels into 1D grid which works wonderfully. However, if I want to implement preprocessing kernel function (resizeAndNorm) for batch data, I wonder which grid, 1D grid or 2D grid should be better. I would appreciate your suggestion thank you.
Also, I noticed that doNms does not use gpu kernel and I would like to know the reason? why not use gpu for post processing? |
@yiwenwan2008 @batselem Perhaps can consider NVIDIA official support for YOLOv4: https://github.com/NVIDIA-AI-IOT/yolov4_deepstream (includes GPU-based post-processing via batchedNMSplugin); benchmark can be found here. |
@jstumpin thank you for your suggestion. I will try it. |
@jstumpin I found out that in the implementation you suggested, the author uses methods like cv::Size which runs on CPU. I tried this method before with both cv::cuda::Resize and cv::Size, they were both very slow.
|
@batselem For the given test.jpg (4134x1653) example here, cv::cuda::Resize with the typical memory copy would give me, on average, 0.452ms while cv::resize's 1.631ms. The key to speed is to minimize overheads, namely H2D/D2H-copies. Thus, even if GPU resize is at par with CPU's, the overall latency would normally be in favor of the former method should we keep processing things persistently at one side of the hardware pipeline, e.g. for the said benchmark, cv::cudacodec::createVideoReader is used in lieu of cv::VideoCapture. |
Thanks, @jstumpin I consider your suggestion! |
How do we extend the inference function to support batchSize > 1? For batched inputs, I'm using OpenCV's blobFromImages. It seems to work just fine as batchSize = 1 (using model/weights with maxBatchSize of 2). But how do I parse the output? How do I get to the subsequent batchId?
Thanks.
The text was updated successfully, but these errors were encountered: