Test enhancement and doc correction for QAT #602

SrivastavaKshitij · 2021-08-01T21:38:38Z

Hope you are doing well. This PR

fixes documentation for QAT section
makes the output much more readable for test.py
Added pSNR (peak signal to noise ratio test) for all the converters.

I have been using pSNR test for quite sometime to test the correctness of TRT conversion instead of max difference as pSNR is more robust test than max difference.

There is an issue with interpolate layer in TRT>=7 when align_corners=True. I added a unit test for a similar set of parameters where internal model metrics were showing regression. However, the unit test passed the max_difference test in the repo but failed on pSNR test. I will get back to this test case at the end

Usually, when we calculate pSNR at FP32 , if pSNR>=100 , it is safe to say that the conversion was fine. The best case is when pSNR=NaN , i.e. , MSE or mean squared error was zero , that led to pSNR being an infinite number which means that both the output tensors (pytorch model output and trt model output) were identical

Although I havent added pSNR test at FP16 in the repo, I can summarize my findings here , so that if you feel free, we can add those later on.

Inspired from Image processing basics, if

pSNR at FP32 >= 100db , we can safely say that the conversion is good
pSNR at FP16 ~= (pSNR at FP32)//2 - x

x~=10. Based on numerous experiments that I did, x is around 10db and accounts for variance.

There is no mathematical explanation but a toolchain explanation.
TRT basically fuses layers , uses different kernels to execute same layers after fusion, so these optimizations introduce some kind of bias (as one would call). However, if you look at the model as a whole, net effect is 0.
Since we are comparing an unoptimized network with an optimized one, value by value , we can see that difference during pSNR test. However, it will not show up when we run a PR.

For e.g.

pSNR at FP32 = 120db
pSNR at FP16 = 120/2-10 = 50db

For FP32, if pSNR is less than 100db (regardless of the above example) , then there is a conversion issue. 
For FP16, if pSNR is less than 50db (in the above example), then there is a conversion issue

pSNR will drop if MSE (mean squared error) is high

Now back to interpolate conversion issue

|                    torch2trt.converters.interpolate.test_bilinear_mode | float32 |          [(1, 4, 12, 12)] | {} | 2.38E-07 | 160.7023 | 0.0000 | 4.08e+04 | 1.58e+04 | 0.0425 | 0.0799 |
|                     torch2trt.converters.interpolate.test_align_corner | float32 |          [(1, 3, 12, 12)] | {} | 1.90E+00 | 14.5676 | 0.2077 | 4.21e+04 | 1.58e+04 | 0.0423 | 0.0792 |
|          torch2trt.converters.interpolate.test_align_corner_functional | float32 |          [(1, 3, 12, 12)] | {} | 2.01E+00 | 16.1512 | 0.1890 | 4.09e+04 | 1.54e+04 | 0.0433 | 0.0819 |
|    torch2trt.converters.interpolate.test_bilinear_mode_odd_input_shape | float32 |          [(1, 5, 13, 13)] | {} | 2.38E-07 | 155.6233 | 0.0000 | 4.2e+04 | 1.59e+04 | 0.0419 | 0.0798 |

As you can see all the 4 tests pass max difference test. However, the middle tests represent test cases where align_corner=True and pSNR <20db at FP32 which indicates that there is a problem in conversion and it leads to degradation in model metrics. 1st and 4th tests have align_corner=False and pSNR is way greater than 100.

I have already raised the issue with Nvidia and it is being looked into. I think pSNR test will make our unit tests more robust and we should definitely add them

Let me know if you have any questions

Thanks

Kshitij Srivastava

jaybdub · 2021-08-09T19:35:47Z

Hi @SrivastavaKshitij ,

Thanks for sharing this. I just reviewed the PR, and it looks good to me.

I'm going to quickly test and assuming everything goes smoothly, should be good to merge.

Best,
John

kshitij added 4 commits August 1, 2021 13:23

added psnr for unit tests

72a8518

fixed qat doc and made the output of test.py cleaner

b86649b

fixed display precision

1b1293c

added color for psnr failure in the unit tests

4393e4f

jaybdub merged commit 311f328 into NVIDIA-AI-IOT:master Aug 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Test enhancement and doc correction for QAT #602

Test enhancement and doc correction for QAT #602

Uh oh!

SrivastavaKshitij commented Aug 1, 2021

Uh oh!

jaybdub commented Aug 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Test enhancement and doc correction for QAT #602

Test enhancement and doc correction for QAT #602

Uh oh!

Conversation

SrivastavaKshitij commented Aug 1, 2021

Uh oh!

jaybdub commented Aug 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants