Skip to content

Conversation

@alexcovington
Copy link
Contributor

This is a follow up PR to #111505 and extends the logic to also support byte, sbyte, short, ushort, and uint.

I'm seeing good performance gains in microbenchmarks:

| Namespace                       | Type                     | Method                    | Job        | Toolchain                   | Mean       | Error     | StdDev    | Median     | Min        | Max        | Ratio | Allocated | Alloc Ratio |
|-------------------------------- |------------------------- |-------------------------- |----------- |---------------------------- |-----------:|----------:|----------:|-----------:|-----------:|-----------:|------:|----------:|------------:|
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<Byte>   | DivisionOperatorBenchmark | Job-YVTVXH | \base\Core_Root\corerun.exe | 16.0008 ns | 0.0905 ns | 0.0756 ns | 15.9842 ns | 15.9182 ns | 16.1866 ns |  1.00 |         - |          NA |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<Byte>   | DivisionOperatorBenchmark | Job-LKDQFQ | \diff\Core_Root\corerun.exe |  2.2485 ns | 0.0214 ns | 0.0200 ns |  2.2437 ns |  2.2218 ns |  2.2803 ns |  0.14 |         - |          NA |
|                                 |                          |                           |            |                             |            |           |           |            |            |            |       |           |             |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<Int16>  | DivisionOperatorBenchmark | Job-YVTVXH | \base\Core_Root\corerun.exe | 11.1451 ns | 0.0566 ns | 0.0502 ns | 11.1348 ns | 11.0775 ns | 11.2137 ns |  1.00 |         - |          NA |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<Int16>  | DivisionOperatorBenchmark | Job-LKDQFQ | \diff\Core_Root\corerun.exe |  0.9565 ns | 0.0171 ns | 0.0160 ns |  0.9524 ns |  0.9344 ns |  0.9900 ns |  0.09 |         - |          NA |
|                                 |                          |                           |            |                             |            |           |           |            |            |            |       |           |             |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<SByte>  | DivisionOperatorBenchmark | Job-YVTVXH | \base\Core_Root\corerun.exe | 15.9213 ns | 0.0428 ns | 0.0379 ns | 15.9165 ns | 15.8495 ns | 15.9954 ns |  1.00 |         - |          NA |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<SByte>  | DivisionOperatorBenchmark | Job-LKDQFQ | \diff\Core_Root\corerun.exe |  2.9481 ns | 0.0123 ns | 0.0103 ns |  2.9515 ns |  2.9259 ns |  2.9597 ns |  0.19 |         - |          NA |
|                                 |                          |                           |            |                             |            |           |           |            |            |            |       |           |             |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<UInt32> | DivisionOperatorBenchmark | Job-YVTVXH | \base\Core_Root\corerun.exe | 11.5403 ns | 0.0746 ns | 0.0697 ns | 11.5224 ns | 11.4555 ns | 11.6713 ns |  1.00 |         - |          NA |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<UInt32> | DivisionOperatorBenchmark | Job-LKDQFQ | \diff\Core_Root\corerun.exe |  0.7485 ns | 0.0115 ns | 0.0096 ns |  0.7452 ns |  0.7407 ns |  0.7744 ns |  0.06 |         - |          NA |
|                                 |                          |                           |            |                             |            |           |           |            |            |            |       |           |             |
| System.Numerics.Tests           | Perf_VectorOf<Byte>      | DivisionOperatorBenchmark | Job-YVTVXH | \base\Core_Root\corerun.exe | 33.6398 ns | 0.2243 ns | 0.1873 ns | 33.6597 ns | 33.2626 ns | 33.8734 ns |  1.00 |         - |          NA |
| System.Numerics.Tests           | Perf_VectorOf<Byte>      | DivisionOperatorBenchmark | Job-LKDQFQ | \diff\Core_Root\corerun.exe |  5.5416 ns | 0.0196 ns | 0.0184 ns |  5.5399 ns |  5.5022 ns |  5.5758 ns |  0.16 |         - |          NA |
|                                 |                          |                           |            |                             |            |           |           |            |            |            |       |           |             |
| System.Numerics.Tests           | Perf_VectorOf<Int16>     | DivisionOperatorBenchmark | Job-YVTVXH | \base\Core_Root\corerun.exe | 15.9211 ns | 0.0556 ns | 0.0493 ns | 15.8951 ns | 15.8673 ns | 16.0357 ns |  1.00 |         - |          NA |
| System.Numerics.Tests           | Perf_VectorOf<Int16>     | DivisionOperatorBenchmark | Job-LKDQFQ | \diff\Core_Root\corerun.exe |  2.2388 ns | 0.0111 ns | 0.0104 ns |  2.2361 ns |  2.2223 ns |  2.2560 ns |  0.14 |         - |          NA |
|                                 |                          |                           |            |                             |            |           |           |            |            |            |       |           |             |
| System.Numerics.Tests           | Perf_VectorOf<SByte>     | DivisionOperatorBenchmark | Job-YVTVXH | \base\Core_Root\corerun.exe | 33.1180 ns | 0.1015 ns | 0.0899 ns | 33.1308 ns | 32.9875 ns | 33.2903 ns |  1.00 |         - |          NA |
| System.Numerics.Tests           | Perf_VectorOf<SByte>     | DivisionOperatorBenchmark | Job-LKDQFQ | \diff\Core_Root\corerun.exe |  5.7022 ns | 0.0679 ns | 0.0635 ns |  5.6862 ns |  5.6420 ns |  5.8356 ns |  0.17 |         - |          NA |
|                                 |                          |                           |            |                             |            |           |           |            |            |            |       |           |             |
| System.Numerics.Tests           | Perf_VectorOf<UInt16>    | DivisionOperatorBenchmark | Job-YVTVXH | \base\Core_Root\corerun.exe | 15.8871 ns | 0.0765 ns | 0.0716 ns | 15.8697 ns | 15.7895 ns | 16.0494 ns |  1.00 |         - |          NA |
| System.Numerics.Tests           | Perf_VectorOf<UInt16>    | DivisionOperatorBenchmark | Job-LKDQFQ | \diff\Core_Root\corerun.exe |  2.2054 ns | 0.0261 ns | 0.0244 ns |  2.1960 ns |  2.1805 ns |  2.2591 ns |  0.14 |         - |          NA |
|                                 |                          |                           |            |                             |            |           |           |            |            |            |       |           |             |
| System.Numerics.Tests           | Perf_VectorOf<UInt32>    | DivisionOperatorBenchmark | Job-YVTVXH | \base\Core_Root\corerun.exe |  7.5162 ns | 0.0338 ns | 0.0300 ns |  7.5140 ns |  7.4812 ns |  7.5817 ns |  1.00 |         - |          NA |
| System.Numerics.Tests           | Perf_VectorOf<UInt32>    | DivisionOperatorBenchmark | Job-LKDQFQ | \diff\Core_Root\corerun.exe |  0.7454 ns | 0.0043 ns | 0.0040 ns |  0.7456 ns |  0.7390 ns |  0.7511 ns |  0.10 |         - |          NA |
|                                 |                          |                           |            |                             |            |           |           |            |            |            |       |           |             |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<Byte>   | DivideBenchmark           | Job-YVTVXH | \base\Core_Root\corerun.exe | 15.8781 ns | 0.0791 ns | 0.0701 ns | 15.8531 ns | 15.8000 ns | 16.0532 ns |  1.00 |         - |          NA |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<Byte>   | DivideBenchmark           | Job-LKDQFQ | \diff\Core_Root\corerun.exe |  2.2249 ns | 0.0141 ns | 0.0125 ns |  2.2210 ns |  2.2098 ns |  2.2495 ns |  0.14 |         - |          NA |
|                                 |                          |                           |            |                             |            |           |           |            |            |            |       |           |             |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<Int16>  | DivideBenchmark           | Job-YVTVXH | \base\Core_Root\corerun.exe | 11.2452 ns | 0.0603 ns | 0.0534 ns | 11.2517 ns | 11.1700 ns | 11.3383 ns |  1.00 |         - |          NA |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<Int16>  | DivideBenchmark           | Job-LKDQFQ | \diff\Core_Root\corerun.exe |  0.9297 ns | 0.0126 ns | 0.0112 ns |  0.9264 ns |  0.9143 ns |  0.9549 ns |  0.08 |         - |          NA |
|                                 |                          |                           |            |                             |            |           |           |            |            |            |       |           |             |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<SByte>  | DivideBenchmark           | Job-YVTVXH | \base\Core_Root\corerun.exe | 15.9201 ns | 0.0356 ns | 0.0316 ns | 15.9198 ns | 15.8758 ns | 15.9736 ns |  1.00 |         - |          NA |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<SByte>  | DivideBenchmark           | Job-LKDQFQ | \diff\Core_Root\corerun.exe |  2.6696 ns | 0.0028 ns | 0.0023 ns |  2.6688 ns |  2.6667 ns |  2.6738 ns |  0.17 |         - |          NA |
|                                 |                          |                           |            |                             |            |           |           |            |            |            |       |           |             |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<UInt32> | DivideBenchmark           | Job-YVTVXH | \base\Core_Root\corerun.exe | 11.4360 ns | 0.0445 ns | 0.0394 ns | 11.4266 ns | 11.3712 ns | 11.5177 ns |  1.00 |         - |          NA |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<UInt32> | DivideBenchmark           | Job-LKDQFQ | \diff\Core_Root\corerun.exe |  0.7552 ns | 0.0091 ns | 0.0081 ns |  0.7511 ns |  0.7465 ns |  0.7703 ns |  0.07 |         - |          NA |
|                                 |                          |                           |            |                             |            |           |           |            |            |            |       |           |             |
| System.Numerics.Tests           | Perf_VectorOf<Byte>      | DivideBenchmark           | Job-YVTVXH | \base\Core_Root\corerun.exe | 33.6008 ns | 0.1484 ns | 0.1315 ns | 33.6221 ns | 33.3495 ns | 33.8361 ns |  1.00 |         - |          NA |
| System.Numerics.Tests           | Perf_VectorOf<Byte>      | DivideBenchmark           | Job-LKDQFQ | \diff\Core_Root\corerun.exe |  5.5425 ns | 0.0301 ns | 0.0282 ns |  5.5451 ns |  5.5078 ns |  5.5862 ns |  0.16 |         - |          NA |
|                                 |                          |                           |            |                             |            |           |           |            |            |            |       |           |             |
| System.Numerics.Tests           | Perf_VectorOf<Int16>     | DivideBenchmark           | Job-YVTVXH | \base\Core_Root\corerun.exe | 15.9299 ns | 0.0676 ns | 0.0633 ns | 15.9232 ns | 15.8214 ns | 16.0419 ns |  1.00 |         - |          NA |
| System.Numerics.Tests           | Perf_VectorOf<Int16>     | DivideBenchmark           | Job-LKDQFQ | \diff\Core_Root\corerun.exe |  2.2475 ns | 0.0146 ns | 0.0130 ns |  2.2469 ns |  2.2222 ns |  2.2770 ns |  0.14 |         - |          NA |
|                                 |                          |                           |            |                             |            |           |           |            |            |            |       |           |             |
| System.Numerics.Tests           | Perf_VectorOf<SByte>     | DivideBenchmark           | Job-YVTVXH | \base\Core_Root\corerun.exe | 33.1901 ns | 0.1294 ns | 0.1211 ns | 33.2244 ns | 33.0419 ns | 33.4277 ns |  1.00 |         - |          NA |
| System.Numerics.Tests           | Perf_VectorOf<SByte>     | DivideBenchmark           | Job-LKDQFQ | \diff\Core_Root\corerun.exe |  5.6573 ns | 0.0304 ns | 0.0270 ns |  5.6426 ns |  5.6336 ns |  5.7201 ns |  0.17 |         - |          NA |
|                                 |                          |                           |            |                             |            |           |           |            |            |            |       |           |             |
| System.Numerics.Tests           | Perf_VectorOf<UInt16>    | DivideBenchmark           | Job-YVTVXH | \base\Core_Root\corerun.exe | 16.1111 ns | 0.0503 ns | 0.0470 ns | 16.1082 ns | 16.0469 ns | 16.2032 ns |  1.00 |         - |          NA |
| System.Numerics.Tests           | Perf_VectorOf<UInt16>    | DivideBenchmark           | Job-LKDQFQ | \diff\Core_Root\corerun.exe |  2.1842 ns | 0.0143 ns | 0.0134 ns |  2.1828 ns |  2.1639 ns |  2.2116 ns |  0.14 |         - |          NA |
|                                 |                          |                           |            |                             |            |           |           |            |            |            |       |           |             |
| System.Numerics.Tests           | Perf_VectorOf<UInt32>    | DivideBenchmark           | Job-YVTVXH | \base\Core_Root\corerun.exe |  7.5017 ns | 0.0282 ns | 0.0250 ns |  7.5016 ns |  7.4692 ns |  7.5391 ns |  1.00 |         - |          NA |
| System.Numerics.Tests           | Perf_VectorOf<UInt32>    | DivideBenchmark           | Job-LKDQFQ | \diff\Core_Root\corerun.exe |  0.7460 ns | 0.0068 ns | 0.0057 ns |  0.7450 ns |  0.7355 ns |  0.7582 ns |  0.10 |         - |          NA |
Disasm

System.Numerics.Tests.Perf_VectorOf<Byte>.DivisionOperatorBenchmark

Base:

; System.Numerics.Tests.Perf_VectorOf`1[[System.Byte, System.Private.CoreLib]].DivisionOperatorBenchmark()
       sub       rsp,78
       mov       rcx,rdx
       vbroadcastss ymm0,dword ptr [7FF9F8D14710]
       vmovups   [rsp+20],ymm0
       vbroadcastss ymm0,dword ptr [7FF9F8D14714]
       vmovups   [rsp],ymm0
       xor       r8d,r8d
M00_L00:
       lea       rax,[rsp+20]
       movsxd    r10,r8d
       movzx     eax,byte ptr [rax+r10]
       lea       rdx,[rsp]
       movzx     r9d,byte ptr [rdx+r10]
       xor       edx,edx
       div       r9d
       lea       rdx,[rsp+40]
       mov       [rdx+r10],al
       inc       r8d
       cmp       r8d,20
       jl        short M00_L00
       vmovups   ymm0,[rsp+40]
       vmovups   [rcx],ymm0
       mov       rax,rcx
       vzeroupper
       add       rsp,78
       ret

Diff:

; System.Numerics.Tests.Perf_VectorOf`1[[System.Byte, System.Private.CoreLib]].DivisionOperatorBenchmark()
       sub       rsp,48
       vmovaps   [rsp+30],xmm6
       vmovaps   [rsp+20],xmm7
       vpmovzxbd zmm0,xmmword ptr [7FF99D580060]
       vmovaps   zmm1,zmm0
       vmovaps   zmm2,zmm1
       vpmovzxbd zmm3,xmmword ptr [7FF99D580070]
       vmovaps   zmm4,zmm3
       vmovaps   zmm5,zmm4
       vxorpd    ymm6,ymm6,ymm6
       vpcmpeqd  ymm6,ymm6,ymm5
       vptest    ymm6,ymm6
       jne       near ptr M00_L00
       vcvtudq2pd zmm6,ymm2
       vcvtudq2pd zmm7,ymm5
       vdivpd    zmm2,zmm6,zmm7
       vcvttpd2udq ymm2,zmm2
       vextracti32x8 ymm1,zmm1,1
       vextracti32x8 ymm4,zmm4,1
       vxorpd    ymm5,ymm5,ymm5
       vpcmpeqd  ymm5,ymm5,ymm4
       vptest    ymm5,ymm5
       jne       near ptr M00_L00
       vcvtudq2pd zmm5,ymm1
       vcvtudq2pd zmm6,ymm4
       vdivpd    zmm16,zmm5,zmm6
       vcvttpd2udq ymm16,zmm16
       vinserti32x8 zmm2,zmm2,ymm16,1
       vpmovdb   xmm2,zmm2
       vxorpd    ymm5,ymm5,ymm5
       vpcmpeqd  ymm5,ymm5,ymm3
       vptest    ymm5,ymm5
       jne       short M00_L00
       vcvtudq2pd zmm5,ymm0
       vcvtudq2pd zmm6,ymm3
       vdivpd    zmm0,zmm5,zmm6
       vcvttpd2udq ymm0,zmm0
       vxorpd    ymm3,ymm3,ymm3
       vpcmpeqd  ymm3,ymm3,ymm4
       vptest    ymm3,ymm3
       jne       short M00_L00
       vcvtudq2pd zmm3,ymm1
       vcvtudq2pd zmm5,ymm4
       vdivpd    zmm1,zmm3,zmm5
       vcvttpd2udq ymm1,zmm1
       vinserti32x8 zmm0,zmm0,ymm1,1
       vpmovdb   xmm0,zmm0
       vinserti128 ymm0,ymm2,xmm0,1
       vmovups   [rdx],ymm0
       mov       rax,rdx
       vzeroupper
       vmovaps   xmm6,[rsp+30]
       vmovaps   xmm7,[rsp+20]
       add       rsp,48
       ret
M00_L00:
       call      CORINFO_HELP_THROWDIVZERO
       int       3
       call      CORINFO_HELP_OVERFLOW
       int       3

System.Numerics.Tests.Perf_VectorOf<Byte>.DivideBenchmark

Base:

; System.Numerics.Tests.Perf_VectorOf`1[[System.Byte, System.Private.CoreLib]].DivideBenchmark()
       sub       rsp,78
       mov       rcx,rdx
       vbroadcastss ymm0,dword ptr [7FF9F8CF4850]
       vmovups   [rsp+20],ymm0
       vbroadcastss ymm0,dword ptr [7FF9F8CF4854]
       vmovups   [rsp],ymm0
       xor       r8d,r8d
M00_L00:
       lea       rax,[rsp+20]
       movsxd    r10,r8d
       movzx     eax,byte ptr [rax+r10]
       lea       rdx,[rsp]
       movzx     r9d,byte ptr [rdx+r10]
       xor       edx,edx
       div       r9d
       lea       rdx,[rsp+40]
       mov       [rdx+r10],al
       inc       r8d
       cmp       r8d,20
       jl        short M00_L00
       vmovups   ymm0,[rsp+40]
       vmovups   [rcx],ymm0
       mov       rax,rcx
       vzeroupper
       add       rsp,78
       ret

Diff:

; System.Numerics.Tests.Perf_VectorOf`1[[System.Byte, System.Private.CoreLib]].DivideBenchmark()
       sub       rsp,48
       vmovaps   [rsp+30],xmm6
       vmovaps   [rsp+20],xmm7
       vpmovzxbd zmm0,xmmword ptr [7FF99D574A20]
       vmovaps   zmm1,zmm0
       vmovaps   zmm2,zmm1
       vpmovzxbd zmm3,xmmword ptr [7FF99D574A30]
       vmovaps   zmm4,zmm3
       vmovaps   zmm5,zmm4
       vxorpd    ymm6,ymm6,ymm6
       vpcmpeqd  ymm6,ymm6,ymm5
       vptest    ymm6,ymm6
       jne       near ptr M00_L00
       vcvtudq2pd zmm6,ymm2
       vcvtudq2pd zmm7,ymm5
       vdivpd    zmm2,zmm6,zmm7
       vcvttpd2udq ymm2,zmm2
       vextracti32x8 ymm1,zmm1,1
       vextracti32x8 ymm4,zmm4,1
       vxorpd    ymm5,ymm5,ymm5
       vpcmpeqd  ymm5,ymm5,ymm4
       vptest    ymm5,ymm5
       jne       near ptr M00_L00
       vcvtudq2pd zmm5,ymm1
       vcvtudq2pd zmm6,ymm4
       vdivpd    zmm16,zmm5,zmm6
       vcvttpd2udq ymm16,zmm16
       vinserti32x8 zmm2,zmm2,ymm16,1
       vpmovdb   xmm2,zmm2
       vxorpd    ymm5,ymm5,ymm5
       vpcmpeqd  ymm5,ymm5,ymm3
       vptest    ymm5,ymm5
       jne       short M00_L00
       vcvtudq2pd zmm5,ymm0
       vcvtudq2pd zmm6,ymm3
       vdivpd    zmm0,zmm5,zmm6
       vcvttpd2udq ymm0,zmm0
       vxorpd    ymm3,ymm3,ymm3
       vpcmpeqd  ymm3,ymm3,ymm4
       vptest    ymm3,ymm3
       jne       short M00_L00
       vcvtudq2pd zmm3,ymm1
       vcvtudq2pd zmm5,ymm4
       vdivpd    zmm1,zmm3,zmm5
       vcvttpd2udq ymm1,zmm1
       vinserti32x8 zmm0,zmm0,ymm1,1
       vpmovdb   xmm0,zmm0
       vinserti128 ymm0,ymm2,xmm0,1
       vmovups   [rdx],ymm0
       mov       rax,rdx
       vzeroupper
       vmovaps   xmm6,[rsp+30]
       vmovaps   xmm7,[rsp+20]
       add       rsp,48
       ret
M00_L00:
       call      CORINFO_HELP_THROWDIVZERO
       int       3
       call      CORINFO_HELP_OVERFLOW
       int       3

System.Numerics.Tests.Perf_VectorOf<Int16>.DivisionOperatorBenchmark

Base:

; System.Numerics.Tests.Perf_VectorOf`1[[System.Int16, System.Private.CoreLib]].DivisionOperatorBenchmark()
       sub       rsp,78
       mov       rcx,rdx
       vbroadcastss ymm0,dword ptr [7FF9F8D14710]
       vmovups   [rsp+20],ymm0
       vbroadcastss ymm0,dword ptr [7FF9F8D14714]
       vmovups   [rsp],ymm0
       xor       r8d,r8d
M00_L00:
       lea       rax,[rsp+20]
       movsxd    r10,r8d
       movsx     rax,word ptr [rax+r10*2]
       lea       rdx,[rsp]
       movsx     r9,word ptr [rdx+r10*2]
       cdq
       idiv      r9d
       cwde
       lea       rdx,[rsp+40]
       mov       [rdx+r10*2],ax
       inc       r8d
       cmp       r8d,10
       jl        short M00_L00
       vmovups   ymm0,[rsp+40]
       vmovups   [rcx],ymm0
       mov       rax,rcx
       vzeroupper
       add       rsp,78
       ret

Diff:

; System.Numerics.Tests.Perf_VectorOf`1[[System.Int16, System.Private.CoreLib]].DivisionOperatorBenchmark()
       sub       rsp,28
       vpmovsxwd ymm0,xmmword ptr [7FF99D5545E0]
       vpmovsxwd ymm1,xmmword ptr [7FF99D5545F0]
       vxorpd    ymm2,ymm2,ymm2
       vpcmpeqd  ymm2,ymm2,ymm1
       vptest    ymm2,ymm2
       jne       near ptr M00_L00
       vpcmpeqd  ymm2,ymm0,[7FF99D554620]
       vpcmpeqd  ymm3,ymm1,[7FF99D554600]
       vpand     ymm2,ymm2,ymm3
       vptest    ymm2,ymm2
       jne       near ptr M00_L01
       vcvtdq2pd zmm2,ymm0
       vcvtdq2pd zmm3,ymm1
       vdivpd    zmm4,zmm2,zmm3
       vcvttpd2dq ymm4,zmm4
       vpmovdw   xmm2,ymm4
       vxorpd    ymm3,ymm3,ymm3
       vpcmpeqd  ymm3,ymm3,ymm1
       vptest    ymm3,ymm3
       jne       short M00_L00
       vpcmpeqd  ymm3,ymm0,[7FF99D554620]
       vpcmpeqd  ymm4,ymm1,[7FF99D554600]
       vpand     ymm3,ymm3,ymm4
       vptest    ymm3,ymm3
       jne       short M00_L01
       vcvtdq2pd zmm3,ymm0
       vcvtdq2pd zmm4,ymm1
       vdivpd    zmm0,zmm3,zmm4
       vcvttpd2dq ymm0,zmm0
       vpmovdw   xmm0,ymm0
       vinserti128 ymm0,ymm2,xmm0,1
       vmovups   [rdx],ymm0
       mov       rax,rdx
       vzeroupper
       add       rsp,28
       ret
M00_L00:
       call      CORINFO_HELP_THROWDIVZERO
       int       3
M00_L01:
       call      CORINFO_HELP_OVERFLOW
       int       3

System.Numerics.Tests.Perf_VectorOf<Int16>.DivideBenchmark

Base:

; System.Numerics.Tests.Perf_VectorOf`1[[System.Int16, System.Private.CoreLib]].DivideBenchmark()
       sub       rsp,78
       mov       rcx,rdx
       vbroadcastss ymm0,dword ptr [7FF9F8D14850]
       vmovups   [rsp+20],ymm0
       vbroadcastss ymm0,dword ptr [7FF9F8D14854]
       vmovups   [rsp],ymm0
       xor       r8d,r8d
M00_L00:
       lea       rax,[rsp+20]
       movsxd    r10,r8d
       movsx     rax,word ptr [rax+r10*2]
       lea       rdx,[rsp]
       movsx     r9,word ptr [rdx+r10*2]
       cdq
       idiv      r9d
       cwde
       lea       rdx,[rsp+40]
       mov       [rdx+r10*2],ax
       inc       r8d
       cmp       r8d,10
       jl        short M00_L00
       vmovups   ymm0,[rsp+40]
       vmovups   [rcx],ymm0
       mov       rax,rcx
       vzeroupper
       add       rsp,78
       ret

Diff:

; System.Numerics.Tests.Perf_VectorOf`1[[System.Int16, System.Private.CoreLib]].DivideBenchmark()
       sub       rsp,28
       vpmovsxwd ymm0,xmmword ptr [7FF99D574800]
       vpmovsxwd ymm1,xmmword ptr [7FF99D574810]
       vxorpd    ymm2,ymm2,ymm2
       vpcmpeqd  ymm2,ymm2,ymm1
       vptest    ymm2,ymm2
       jne       near ptr M00_L00
       vpcmpeqd  ymm2,ymm0,[7FF99D574840]
       vpcmpeqd  ymm3,ymm1,[7FF99D574820]
       vpand     ymm2,ymm2,ymm3
       vptest    ymm2,ymm2
       jne       near ptr M00_L01
       vcvtdq2pd zmm2,ymm0
       vcvtdq2pd zmm3,ymm1
       vdivpd    zmm4,zmm2,zmm3
       vcvttpd2dq ymm4,zmm4
       vpmovdw   xmm2,ymm4
       vxorpd    ymm3,ymm3,ymm3
       vpcmpeqd  ymm3,ymm3,ymm1
       vptest    ymm3,ymm3
       jne       short M00_L00
       vpcmpeqd  ymm3,ymm0,[7FF99D574840]
       vpcmpeqd  ymm4,ymm1,[7FF99D574820]
       vpand     ymm3,ymm3,ymm4
       vptest    ymm3,ymm3
       jne       short M00_L01
       vcvtdq2pd zmm3,ymm0
       vcvtdq2pd zmm4,ymm1
       vdivpd    zmm0,zmm3,zmm4
       vcvttpd2dq ymm0,zmm0
       vpmovdw   xmm0,ymm0
       vinserti128 ymm0,ymm2,xmm0,1
       vmovups   [rdx],ymm0
       mov       rax,rdx
       vzeroupper
       add       rsp,28
       ret
M00_L00:
       call      CORINFO_HELP_THROWDIVZERO
       int       3
M00_L01:
       call      CORINFO_HELP_OVERFLOW
       int       3

System.Numerics.Tests.Perf_VectorOf<SByte>.DivisionOperatorBenchmark

Base:

; System.Numerics.Tests.Perf_VectorOf`1[[System.SByte, System.Private.CoreLib]].DivisionOperatorBenchmark()
       sub       rsp,78
       mov       rcx,rdx
       vbroadcastss ymm0,dword ptr [7FF9F8D04708]
       vmovups   [rsp+20],ymm0
       vbroadcastss ymm0,dword ptr [7FF9F8D0470C]
       vmovups   [rsp],ymm0
       xor       r8d,r8d
M00_L00:
       lea       rax,[rsp+20]
       movsxd    r10,r8d
       movsx     rax,byte ptr [rax+r10]
       lea       rdx,[rsp]
       movsx     r9,byte ptr [rdx+r10]
       cdq
       idiv      r9d
       lea       rdx,[rsp+40]
       mov       [rdx+r10],al
       inc       r8d
       cmp       r8d,20
       jl        short M00_L00
       vmovups   ymm0,[rsp+40]
       vmovups   [rcx],ymm0
       mov       rax,rcx
       vzeroupper
       add       rsp,78
       ret

Diff:

; System.Numerics.Tests.Perf_VectorOf`1[[System.SByte, System.Private.CoreLib]].DivisionOperatorBenchmark()
       sub       rsp,48
       vmovaps   [rsp+30],xmm6
       vmovaps   [rsp+20],xmm7
       vpmovsxbd zmm0,xmmword ptr [7FF99D564960]
       vmovaps   zmm1,zmm0
       vmovaps   zmm2,zmm1
       vpmovsxbd zmm3,xmmword ptr [7FF99D564970]
       vmovaps   zmm4,zmm3
       vmovaps   zmm5,zmm4
       vxorpd    ymm6,ymm6,ymm6
       vpcmpeqd  ymm6,ymm6,ymm5
       vptest    ymm6,ymm6
       jne       near ptr M00_L00
       vpcmpeqd  ymm6,ymm2,[7FF99D5649A0]
       vpcmpeqd  ymm7,ymm5,[7FF99D564980]
       vpand     ymm6,ymm6,ymm7
       vptest    ymm6,ymm6
       jne       near ptr M00_L01
       vcvtdq2pd zmm6,ymm2
       vcvtdq2pd zmm7,ymm5
       vdivpd    zmm2,zmm6,zmm7
       vcvttpd2dq ymm2,zmm2
       vextracti32x8 ymm1,zmm1,1
       vextracti32x8 ymm4,zmm4,1
       vxorpd    ymm5,ymm5,ymm5
       vpcmpeqd  ymm5,ymm5,ymm4
       vptest    ymm5,ymm5
       jne       near ptr M00_L00
       vpcmpeqd  ymm5,ymm1,[7FF99D5649A0]
       vpcmpeqd  ymm6,ymm4,[7FF99D564980]
       vpand     ymm5,ymm5,ymm6
       vptest    ymm5,ymm5
       jne       near ptr M00_L01
       vcvtdq2pd zmm5,ymm1
       vcvtdq2pd zmm6,ymm4
       vdivpd    zmm16,zmm5,zmm6
       vcvttpd2dq ymm16,zmm16
       vinserti32x8 zmm2,zmm2,ymm16,1
       vpmovdb   xmm2,zmm2
       vxorpd    ymm5,ymm5,ymm5
       vpcmpeqd  ymm5,ymm5,ymm3
       vptest    ymm5,ymm5
       jne       near ptr M00_L00
       vpcmpeqd  ymm5,ymm0,[7FF99D5649A0]
       vpcmpeqd  ymm6,ymm3,[7FF99D564980]
       vpand     ymm5,ymm5,ymm6
       vptest    ymm5,ymm5
       jne       near ptr M00_L01
       vcvtdq2pd zmm5,ymm0
       vcvtdq2pd zmm6,ymm3
       vdivpd    zmm0,zmm5,zmm6
       vcvttpd2dq ymm0,zmm0
       vxorpd    ymm3,ymm3,ymm3
       vpcmpeqd  ymm3,ymm3,ymm4
       vptest    ymm3,ymm3
       jne       short M00_L00
       vpcmpeqd  ymm3,ymm1,[7FF99D5649A0]
       vpcmpeqd  ymm5,ymm4,[7FF99D564980]
       vpand     ymm3,ymm3,ymm5
       vptest    ymm3,ymm3
       jne       short M00_L01
       vcvtdq2pd zmm3,ymm1
       vcvtdq2pd zmm5,ymm4
       vdivpd    zmm1,zmm3,zmm5
       vcvttpd2dq ymm1,zmm1
       vinserti32x8 zmm0,zmm0,ymm1,1
       vpmovdb   xmm0,zmm0
       vinserti128 ymm0,ymm2,xmm0,1
       vmovups   [rdx],ymm0
       mov       rax,rdx
       vzeroupper
       vmovaps   xmm6,[rsp+30]
       vmovaps   xmm7,[rsp+20]
       add       rsp,48
       ret
M00_L00:
       call      CORINFO_HELP_THROWDIVZERO
       int       3
M00_L01:
       call      CORINFO_HELP_OVERFLOW
       int       3

System.Numerics.Tests.Perf_VectorOf<SByte>.DivideBenchmark

Base:

; System.Numerics.Tests.Perf_VectorOf`1[[System.SByte, System.Private.CoreLib]].DivideBenchmark()
       sub       rsp,78
       mov       rcx,rdx
       vbroadcastss ymm0,dword ptr [7FF9F8D14848]
       vmovups   [rsp+20],ymm0
       vbroadcastss ymm0,dword ptr [7FF9F8D1484C]
       vmovups   [rsp],ymm0
       xor       r8d,r8d
M00_L00:
       lea       rax,[rsp+20]
       movsxd    r10,r8d
       movsx     rax,byte ptr [rax+r10]
       lea       rdx,[rsp]
       movsx     r9,byte ptr [rdx+r10]
       cdq
       idiv      r9d
       lea       rdx,[rsp+40]
       mov       [rdx+r10],al
       inc       r8d
       cmp       r8d,20
       jl        short M00_L00
       vmovups   ymm0,[rsp+40]
       vmovups   [rcx],ymm0
       mov       rax,rcx
       vzeroupper
       add       rsp,78
       ret

Diff:

; System.Numerics.Tests.Perf_VectorOf`1[[System.SByte, System.Private.CoreLib]].DivideBenchmark()
       sub       rsp,48
       vmovaps   [rsp+30],xmm6
       vmovaps   [rsp+20],xmm7
       vpmovsxbd zmm0,xmmword ptr [7FF99D574C40]
       vmovaps   zmm1,zmm0
       vmovaps   zmm2,zmm1
       vpmovsxbd zmm3,xmmword ptr [7FF99D574C50]
       vmovaps   zmm4,zmm3
       vmovaps   zmm5,zmm4
       vxorpd    ymm6,ymm6,ymm6
       vpcmpeqd  ymm6,ymm6,ymm5
       vptest    ymm6,ymm6
       jne       near ptr M00_L00
       vpcmpeqd  ymm6,ymm2,[7FF99D574C80]
       vpcmpeqd  ymm7,ymm5,[7FF99D574C60]
       vpand     ymm6,ymm6,ymm7
       vptest    ymm6,ymm6
       jne       near ptr M00_L01
       vcvtdq2pd zmm6,ymm2
       vcvtdq2pd zmm7,ymm5
       vdivpd    zmm2,zmm6,zmm7
       vcvttpd2dq ymm2,zmm2
       vextracti32x8 ymm1,zmm1,1
       vextracti32x8 ymm4,zmm4,1
       vxorpd    ymm5,ymm5,ymm5
       vpcmpeqd  ymm5,ymm5,ymm4
       vptest    ymm5,ymm5
       jne       near ptr M00_L00
       vpcmpeqd  ymm5,ymm1,[7FF99D574C80]
       vpcmpeqd  ymm6,ymm4,[7FF99D574C60]
       vpand     ymm5,ymm5,ymm6
       vptest    ymm5,ymm5
       jne       near ptr M00_L01
       vcvtdq2pd zmm5,ymm1
       vcvtdq2pd zmm6,ymm4
       vdivpd    zmm16,zmm5,zmm6
       vcvttpd2dq ymm16,zmm16
       vinserti32x8 zmm2,zmm2,ymm16,1
       vpmovdb   xmm2,zmm2
       vxorpd    ymm5,ymm5,ymm5
       vpcmpeqd  ymm5,ymm5,ymm3
       vptest    ymm5,ymm5
       jne       near ptr M00_L00
       vpcmpeqd  ymm5,ymm0,[7FF99D574C80]
       vpcmpeqd  ymm6,ymm3,[7FF99D574C60]
       vpand     ymm5,ymm5,ymm6
       vptest    ymm5,ymm5
       jne       near ptr M00_L01
       vcvtdq2pd zmm5,ymm0
       vcvtdq2pd zmm6,ymm3
       vdivpd    zmm0,zmm5,zmm6
       vcvttpd2dq ymm0,zmm0
       vxorpd    ymm3,ymm3,ymm3
       vpcmpeqd  ymm3,ymm3,ymm4
       vptest    ymm3,ymm3
       jne       short M00_L00
       vpcmpeqd  ymm3,ymm1,[7FF99D574C80]
       vpcmpeqd  ymm5,ymm4,[7FF99D574C60]
       vpand     ymm3,ymm3,ymm5
       vptest    ymm3,ymm3
       jne       short M00_L01
       vcvtdq2pd zmm3,ymm1
       vcvtdq2pd zmm5,ymm4
       vdivpd    zmm1,zmm3,zmm5
       vcvttpd2dq ymm1,zmm1
       vinserti32x8 zmm0,zmm0,ymm1,1
       vpmovdb   xmm0,zmm0
       vinserti128 ymm0,ymm2,xmm0,1
       vmovups   [rdx],ymm0
       mov       rax,rdx
       vzeroupper
       vmovaps   xmm6,[rsp+30]
       vmovaps   xmm7,[rsp+20]
       add       rsp,48
       ret
M00_L00:
       call      CORINFO_HELP_THROWDIVZERO
       int       3
M00_L01:
       call      CORINFO_HELP_OVERFLOW
       int       3

System.Numerics.Tests.Perf_VectorOf<UInt16>.DivisionOperatorBenchmark

Base:

; System.Numerics.Tests.Perf_VectorOf`1[[System.UInt16, System.Private.CoreLib]].DivisionOperatorBenchmark()
       sub       rsp,78
       mov       rcx,rdx
       vbroadcastss ymm0,dword ptr [7FF9F8CF4710]
       vmovups   [rsp+20],ymm0
       vbroadcastss ymm0,dword ptr [7FF9F8CF4714]
       vmovups   [rsp],ymm0
       xor       r8d,r8d
M00_L00:
       lea       rax,[rsp+20]
       movsxd    r10,r8d
       movzx     eax,word ptr [rax+r10*2]
       lea       rdx,[rsp]
       movzx     r9d,word ptr [rdx+r10*2]
       xor       edx,edx
       div       r9d
       movzx     eax,ax
       lea       rdx,[rsp+40]
       mov       [rdx+r10*2],ax
       inc       r8d
       cmp       r8d,10
       jl        short M00_L00
       vmovups   ymm0,[rsp+40]
       vmovups   [rcx],ymm0
       mov       rax,rcx
       vzeroupper
       add       rsp,78
       ret

Diff:

; System.Numerics.Tests.Perf_VectorOf`1[[System.UInt16, System.Private.CoreLib]].DivisionOperatorBenchmark()
       sub       rsp,28
       vpmovzxwd ymm0,xmmword ptr [7FF99D574500]
       vpmovzxwd ymm1,xmmword ptr [7FF99D574510]
       vxorpd    ymm2,ymm2,ymm2
       vpcmpeqd  ymm2,ymm2,ymm1
       vptest    ymm2,ymm2
       jne       short M00_L00
       vcvtudq2pd zmm2,ymm0
       vcvtudq2pd zmm3,ymm1
       vdivpd    zmm4,zmm2,zmm3
       vcvttpd2udq ymm4,zmm4
       vpmovdw   xmm2,ymm4
       vxorpd    ymm3,ymm3,ymm3
       vpcmpeqd  ymm3,ymm3,ymm1
       vptest    ymm3,ymm3
       jne       short M00_L00
       vcvtudq2pd zmm3,ymm0
       vcvtudq2pd zmm4,ymm1
       vdivpd    zmm0,zmm3,zmm4
       vcvttpd2udq ymm0,zmm0
       vpmovdw   xmm0,ymm0
       vinserti128 ymm0,ymm2,xmm0,1
       vmovups   [rdx],ymm0
       mov       rax,rdx
       vzeroupper
       add       rsp,28
       ret
M00_L00:
       call      CORINFO_HELP_THROWDIVZERO
       int       3
       call      CORINFO_HELP_OVERFLOW
       int       3

System.Numerics.Tests.Perf_VectorOf<UInt16>.DivideBenchmark

Base:

; System.Numerics.Tests.Perf_VectorOf`1[[System.UInt16, System.Private.CoreLib]].DivideBenchmark()
       sub       rsp,78
       mov       rcx,rdx
       vbroadcastss ymm0,dword ptr [7FF9F8D14850]
       vmovups   [rsp+20],ymm0
       vbroadcastss ymm0,dword ptr [7FF9F8D14854]
       vmovups   [rsp],ymm0
       xor       r8d,r8d
M00_L00:
       lea       rax,[rsp+20]
       movsxd    r10,r8d
       movzx     eax,word ptr [rax+r10*2]
       lea       rdx,[rsp]
       movzx     r9d,word ptr [rdx+r10*2]
       xor       edx,edx
       div       r9d
       movzx     eax,ax
       lea       rdx,[rsp+40]
       mov       [rdx+r10*2],ax
       inc       r8d
       cmp       r8d,10
       jl        short M00_L00
       vmovups   ymm0,[rsp+40]
       vmovups   [rcx],ymm0
       mov       rax,rcx
       vzeroupper
       add       rsp,78
       ret

Diff:

; System.Numerics.Tests.Perf_VectorOf`1[[System.UInt16, System.Private.CoreLib]].DivideBenchmark()
       sub       rsp,28
       vpmovzxwd ymm0,xmmword ptr [7FF99D5746C0]
       vpmovzxwd ymm1,xmmword ptr [7FF99D5746D0]
       vxorpd    ymm2,ymm2,ymm2
       vpcmpeqd  ymm2,ymm2,ymm1
       vptest    ymm2,ymm2
       jne       short M00_L00
       vcvtudq2pd zmm2,ymm0
       vcvtudq2pd zmm3,ymm1
       vdivpd    zmm4,zmm2,zmm3
       vcvttpd2udq ymm4,zmm4
       vpmovdw   xmm2,ymm4
       vxorpd    ymm3,ymm3,ymm3
       vpcmpeqd  ymm3,ymm3,ymm1
       vptest    ymm3,ymm3
       jne       short M00_L00
       vcvtudq2pd zmm3,ymm0
       vcvtudq2pd zmm4,ymm1
       vdivpd    zmm0,zmm3,zmm4
       vcvttpd2udq ymm0,zmm0
       vpmovdw   xmm0,ymm0
       vinserti128 ymm0,ymm2,xmm0,1
       vmovups   [rdx],ymm0
       mov       rax,rdx
       vzeroupper
       add       rsp,28
       ret
M00_L00:
       call      CORINFO_HELP_THROWDIVZERO
       int       3
       call      CORINFO_HELP_OVERFLOW
       int       3

System.Numerics.Tests.Perf_VectorOf<UInt32>.DivisionOperatorBenchmark

Base:

; System.Numerics.Tests.Perf_VectorOf`1[[System.UInt32, System.Private.CoreLib]].DivisionOperatorBenchmark()
       sub       rsp,78
       mov       rcx,rdx
       vbroadcastss ymm0,dword ptr [7FF9F8D24788]
       vmovups   [rsp+20],ymm0
       vbroadcastss ymm0,dword ptr [7FF9F8D2478C]
       vmovups   [rsp],ymm0
       mov       eax,[rsp+20]
       xor       edx,edx
       div       dword ptr [rsp]
       mov       [rsp+40],eax
       mov       eax,[rsp+24]
       xor       edx,edx
       div       dword ptr [rsp+4]
       mov       [rsp+44],eax
       mov       eax,[rsp+28]
       xor       edx,edx
       div       dword ptr [rsp+8]
       mov       [rsp+48],eax
       mov       eax,[rsp+2C]
       xor       edx,edx
       div       dword ptr [rsp+0C]
       mov       [rsp+4C],eax
       mov       eax,[rsp+30]
       xor       edx,edx
       div       dword ptr [rsp+10]
       mov       [rsp+50],eax
       mov       eax,[rsp+34]
       xor       edx,edx
       div       dword ptr [rsp+14]
       mov       [rsp+54],eax
       mov       eax,[rsp+38]
       xor       edx,edx
       div       dword ptr [rsp+18]
       mov       [rsp+58],eax
       mov       eax,[rsp+3C]
       xor       edx,edx
       div       dword ptr [rsp+1C]
       mov       [rsp+5C],eax
       vmovups   ymm0,[rsp+40]
       vmovups   [rcx],ymm0
       mov       rax,rcx
       vzeroupper
       add       rsp,78
       ret

Diff:

; System.Numerics.Tests.Perf_VectorOf`1[[System.UInt32, System.Private.CoreLib]].DivisionOperatorBenchmark()
       sub       rsp,28
       vbroadcastss ymm0,dword ptr [7FF99D5643C0]
       vbroadcastss ymm1,dword ptr [7FF99D5643C4]
       vxorpd    ymm2,ymm2,ymm2
       vpcmpeqd  ymm2,ymm2,ymm1
       vptest    ymm2,ymm2
       jne       short M00_L00
       vcvtudq2pd zmm2,ymm0
       vcvtudq2pd zmm3,ymm1
       vdivpd    zmm0,zmm2,zmm3
       vcvttpd2udq ymm0,zmm0
       vmovups   [rdx],ymm0
       mov       rax,rdx
       vzeroupper
       add       rsp,28
       ret
M00_L00:
       call      CORINFO_HELP_THROWDIVZERO
       int       3
       call      CORINFO_HELP_OVERFLOW
       int       3

System.Numerics.Tests.Perf_VectorOf<UInt32>.DivideBenchmark

Base:

; System.Numerics.Tests.Perf_VectorOf`1[[System.UInt32, System.Private.CoreLib]].DivideBenchmark()
       sub       rsp,78
       mov       rcx,rdx
       vbroadcastss ymm0,dword ptr [7FF9F8D24908]
       vmovups   [rsp+20],ymm0
       vbroadcastss ymm0,dword ptr [7FF9F8D2490C]
       vmovups   [rsp],ymm0
       mov       eax,[rsp+20]
       xor       edx,edx
       div       dword ptr [rsp]
       mov       [rsp+40],eax
       mov       eax,[rsp+24]
       xor       edx,edx
       div       dword ptr [rsp+4]
       mov       [rsp+44],eax
       mov       eax,[rsp+28]
       xor       edx,edx
       div       dword ptr [rsp+8]
       mov       [rsp+48],eax
       mov       eax,[rsp+2C]
       xor       edx,edx
       div       dword ptr [rsp+0C]
       mov       [rsp+4C],eax
       mov       eax,[rsp+30]
       xor       edx,edx
       div       dword ptr [rsp+10]
       mov       [rsp+50],eax
       mov       eax,[rsp+34]
       xor       edx,edx
       div       dword ptr [rsp+14]
       mov       [rsp+54],eax
       mov       eax,[rsp+38]
       xor       edx,edx
       div       dword ptr [rsp+18]
       mov       [rsp+58],eax
       mov       eax,[rsp+3C]
       xor       edx,edx
       div       dword ptr [rsp+1C]
       mov       [rsp+5C],eax
       vmovups   ymm0,[rsp+40]
       vmovups   [rcx],ymm0
       mov       rax,rcx
       vzeroupper
       add       rsp,78
       ret

Diff:

; System.Numerics.Tests.Perf_VectorOf`1[[System.UInt32, System.Private.CoreLib]].DivideBenchmark()
       sub       rsp,28
       vbroadcastss ymm0,dword ptr [7FF99D574520]
       vbroadcastss ymm1,dword ptr [7FF99D574524]
       vxorpd    ymm2,ymm2,ymm2
       vpcmpeqd  ymm2,ymm2,ymm1
       vptest    ymm2,ymm2
       jne       short M00_L00
       vcvtudq2pd zmm2,ymm0
       vcvtudq2pd zmm3,ymm1
       vdivpd    zmm0,zmm2,zmm3
       vcvttpd2udq ymm0,zmm0
       vmovups   [rdx],ymm0
       mov       rax,rdx
       vzeroupper
       add       rsp,28
       ret
M00_L00:
       call      CORINFO_HELP_THROWDIVZERO
       int       3
       call      CORINFO_HELP_OVERFLOW
       int       3

System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.DivisionOperatorBenchmark

Base:

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.Byte, System.Private.CoreLib]].DivisionOperatorBenchmark()
       sub       rsp,88
       mov       rcx,rdx
       vpcmpeqd  xmm0,xmm0,xmm0
       vmovaps   [rsp+70],xmm0
       vbroadcastss xmm0,dword ptr [7FF9F8D24550]
       vmovaps   [rsp+60],xmm0
       mov       rax,[rsp+70]
       mov       [rsp+30],rax
       mov       rax,[rsp+60]
       mov       [rsp+28],rax
       xor       r8d,r8d
       nop       word ptr [rax+rax]
M00_L00:
       lea       rax,[rsp+30]
       movsxd    r10,r8d
       movzx     eax,byte ptr [rax+r10]
       lea       rdx,[rsp+28]
       movzx     r9d,byte ptr [rdx+r10]
       xor       edx,edx
       div       r9d
       lea       rdx,[rsp+38]
       mov       [rdx+r10],al
       inc       r8d
       cmp       r8d,8
       jl        short M00_L00
       vpcmpeqd  xmm1,xmm1,xmm1
       vmovaps   [rsp+50],xmm1
       vmovaps   [rsp+40],xmm0
       mov       r8,[rsp+38]
       mov       rax,[rsp+58]
       mov       [rsp+18],rax
       mov       rax,[rsp+48]
       mov       [rsp+10],rax
       xor       r10d,r10d
       nop       word ptr [rax+rax]
M00_L01:
       lea       rax,[rsp+18]
       movsxd    r9,r10d
       movzx     eax,byte ptr [rax+r9]
       lea       rdx,[rsp+10]
       movzx     r11d,byte ptr [rdx+r9]
       xor       edx,edx
       div       r11d
       lea       rdx,[rsp+20]
       mov       [rdx+r9],al
       inc       r10d
       cmp       r10d,8
       jl        short M00_L01
       mov       rax,[rsp+20]
       mov       [rsp],r8
       mov       [rsp+8],rax
       vmovaps   xmm0,[rsp]
       vmovups   [rcx],xmm0
       mov       rax,rcx
       add       rsp,88
       ret

Diff:

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.Byte, System.Private.CoreLib]].DivisionOperatorBenchmark()
       sub       rsp,28
       vpcmpeqd  xmm0,xmm0,xmm0
       vpmovzxbd zmm0,xmm0
       vmovaps   zmm1,zmm0
       vpmovzxbd zmm2,xmmword ptr [7FF99D554540]
       vmovaps   zmm3,zmm2
       vxorpd    ymm4,ymm4,ymm4
       vpcmpeqd  ymm4,ymm4,ymm3
       vptest    ymm4,ymm4
       jne       short M00_L00
       vcvtudq2pd zmm4,ymm1
       vcvtudq2pd zmm5,ymm3
       vdivpd    zmm1,zmm4,zmm5
       vcvttpd2udq ymm1,zmm1
       vextracti32x8 ymm0,zmm0,1
       vextracti32x8 ymm2,zmm2,1
       vxorpd    ymm3,ymm3,ymm3
       vpcmpeqd  ymm3,ymm3,ymm2
       vptest    ymm3,ymm3
       jne       short M00_L00
       vcvtudq2pd zmm3,ymm0
       vcvtudq2pd zmm4,ymm2
       vdivpd    zmm0,zmm3,zmm4
       vcvttpd2udq ymm0,zmm0
       vinserti32x8 zmm0,zmm1,ymm0,1
       vpmovdb   xmmword ptr [rdx],zmm0
       mov       rax,rdx
       vzeroupper
       add       rsp,28
       ret
M00_L00:
       call      CORINFO_HELP_THROWDIVZERO
       int       3
       call      CORINFO_HELP_OVERFLOW
       int       3

System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.DivideBenchmark

Base:

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.Byte, System.Private.CoreLib]].DivideBenchmark()
       sub       rsp,88
       mov       rcx,rdx
       vpcmpeqd  xmm0,xmm0,xmm0
       vmovaps   [rsp+70],xmm0
       vbroadcastss xmm0,dword ptr [7FF9F8D146F0]
       vmovaps   [rsp+60],xmm0
       mov       rax,[rsp+70]
       mov       [rsp+30],rax
       mov       rax,[rsp+60]
       mov       [rsp+28],rax
       xor       r8d,r8d
       nop       word ptr [rax+rax]
M00_L00:
       lea       rax,[rsp+30]
       movsxd    r10,r8d
       movzx     eax,byte ptr [rax+r10]
       lea       rdx,[rsp+28]
       movzx     r9d,byte ptr [rdx+r10]
       xor       edx,edx
       div       r9d
       lea       rdx,[rsp+38]
       mov       [rdx+r10],al
       inc       r8d
       cmp       r8d,8
       jl        short M00_L00
       vpcmpeqd  xmm1,xmm1,xmm1
       vmovaps   [rsp+50],xmm1
       vmovaps   [rsp+40],xmm0
       mov       r8,[rsp+38]
       mov       rax,[rsp+58]
       mov       [rsp+18],rax
       mov       rax,[rsp+48]
       mov       [rsp+10],rax
       xor       r10d,r10d
       nop       word ptr [rax+rax]
M00_L01:
       lea       rax,[rsp+18]
       movsxd    r9,r10d
       movzx     eax,byte ptr [rax+r9]
       lea       rdx,[rsp+10]
       movzx     r11d,byte ptr [rdx+r9]
       xor       edx,edx
       div       r11d
       lea       rdx,[rsp+20]
       mov       [rdx+r9],al
       inc       r10d
       cmp       r10d,8
       jl        short M00_L01
       mov       rax,[rsp+20]
       mov       [rsp],r8
       mov       [rsp+8],rax
       vmovaps   xmm0,[rsp]
       vmovups   [rcx],xmm0
       mov       rax,rcx
       add       rsp,88
       ret

Diff:

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.Byte, System.Private.CoreLib]].DivideBenchmark()
       sub       rsp,28
       vpcmpeqd  xmm0,xmm0,xmm0
       vpmovzxbd zmm0,xmm0
       vmovaps   zmm1,zmm0
       vpmovzxbd zmm2,xmmword ptr [7FF99D564700]
       vmovaps   zmm3,zmm2
       vxorpd    ymm4,ymm4,ymm4
       vpcmpeqd  ymm4,ymm4,ymm3
       vptest    ymm4,ymm4
       jne       short M00_L00
       vcvtudq2pd zmm4,ymm1
       vcvtudq2pd zmm5,ymm3
       vdivpd    zmm1,zmm4,zmm5
       vcvttpd2udq ymm1,zmm1
       vextracti32x8 ymm0,zmm0,1
       vextracti32x8 ymm2,zmm2,1
       vxorpd    ymm3,ymm3,ymm3
       vpcmpeqd  ymm3,ymm3,ymm2
       vptest    ymm3,ymm3
       jne       short M00_L00
       vcvtudq2pd zmm3,ymm0
       vcvtudq2pd zmm4,ymm2
       vdivpd    zmm0,zmm3,zmm4
       vcvttpd2udq ymm0,zmm0
       vinserti32x8 zmm0,zmm1,ymm0,1
       vpmovdb   xmmword ptr [rdx],zmm0
       mov       rax,rdx
       vzeroupper
       add       rsp,28
       ret
M00_L00:
       call      CORINFO_HELP_THROWDIVZERO
       int       3
       call      CORINFO_HELP_OVERFLOW
       int       3

System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.DivisionOperatorBenchmark

Base:

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.Int16, System.Private.CoreLib]].DivisionOperatorBenchmark()
       sub       rsp,88
       mov       rcx,rdx
       vpcmpeqd  xmm0,xmm0,xmm0
       vmovaps   [rsp+70],xmm0
       vbroadcastss xmm0,dword ptr [7FF9F8D14680]
       vmovaps   [rsp+60],xmm0
       mov       rax,[rsp+70]
       mov       [rsp+30],rax
       mov       rax,[rsp+60]
       mov       [rsp+28],rax
       movsx     rax,word ptr [rsp+30]
       movsx     r8,word ptr [rsp+28]
       cdq
       idiv      r8d
       movsx     r8,ax
       mov       [rsp+38],r8w
       movsx     rax,word ptr [rsp+32]
       movsx     r8,word ptr [rsp+2A]
       cdq
       idiv      r8d
       movsx     r8,ax
       mov       [rsp+3A],r8w
       movsx     rax,word ptr [rsp+34]
       movsx     r8,word ptr [rsp+2C]
       cdq
       idiv      r8d
       movsx     r8,ax
       mov       [rsp+3C],r8w
       movsx     rax,word ptr [rsp+36]
       movsx     r8,word ptr [rsp+2E]
       cdq
       idiv      r8d
       movsx     r8,ax
       mov       [rsp+3E],r8w
       vpcmpeqd  xmm0,xmm0,xmm0
       vmovaps   [rsp+50],xmm0
       vbroadcastss xmm0,dword ptr [7FF9F8D14680]
       vmovaps   [rsp+40],xmm0
       mov       r8,[rsp+38]
       mov       rax,[rsp+58]
       mov       [rsp+18],rax
       mov       rax,[rsp+48]
       mov       [rsp+10],rax
       movsx     rax,word ptr [rsp+18]
       movsx     r10,word ptr [rsp+10]
       cdq
       idiv      r10d
       movsx     r10,ax
       mov       [rsp+20],r10w
       movsx     rax,word ptr [rsp+1A]
       movsx     r10,word ptr [rsp+12]
       cdq
       idiv      r10d
       movsx     r10,ax
       mov       [rsp+22],r10w
       movsx     rax,word ptr [rsp+1C]
       movsx     r10,word ptr [rsp+14]
       cdq
       idiv      r10d
       movsx     r10,ax
       mov       [rsp+24],r10w
       movsx     rax,word ptr [rsp+1E]
       movsx     r10,word ptr [rsp+16]
       cdq
       idiv      r10d
       movsx     r10,ax
       mov       [rsp+26],r10w
       mov       rax,[rsp+20]
       mov       [rsp],r8
       mov       [rsp+8],rax
       vmovaps   xmm0,[rsp]
       vmovups   [rcx],xmm0
       mov       rax,rcx
       add       rsp,88
       ret

Diff:

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.Int16, System.Private.CoreLib]].DivisionOperatorBenchmark()
       sub       rsp,28
       vpcmpeqd  xmm0,xmm0,xmm0
       vpmovsxwd ymm0,xmm0
       vpmovsxwd ymm1,xmmword ptr [7FF99D554460]
       vxorpd    ymm2,ymm2,ymm2
       vpcmpeqd  ymm2,ymm2,ymm1
       vptest    ymm2,ymm2
       jne       short M00_L00
       vpcmpeqd  ymm2,ymm0,[7FF99D5544A0]
       vpcmpeqd  ymm3,ymm1,[7FF99D554480]
       vpand     ymm2,ymm2,ymm3
       vptest    ymm2,ymm2
       jne       short M00_L01
       vcvtdq2pd zmm2,ymm0
       vcvtdq2pd zmm3,ymm1
       vdivpd    zmm0,zmm2,zmm3
       vcvttpd2dq ymm0,zmm0
       vpmovdw   xmmword ptr [rdx],ymm0
       mov       rax,rdx
       vzeroupper
       add       rsp,28
       ret
M00_L00:
       call      CORINFO_HELP_THROWDIVZERO
       int       3
M00_L01:
       call      CORINFO_HELP_OVERFLOW
       int       3

System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.DivideBenchmark

Base:

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.Int16, System.Private.CoreLib]].DivideBenchmark()
       sub       rsp,88
       mov       rcx,rdx
       vpcmpeqd  xmm0,xmm0,xmm0
       vmovaps   [rsp+70],xmm0
       vbroadcastss xmm0,dword ptr [7FF9F8D04880]
       vmovaps   [rsp+60],xmm0
       mov       rax,[rsp+70]
       mov       [rsp+30],rax
       mov       rax,[rsp+60]
       mov       [rsp+28],rax
       movsx     rax,word ptr [rsp+30]
       movsx     r8,word ptr [rsp+28]
       cdq
       idiv      r8d
       movsx     r8,ax
       mov       [rsp+38],r8w
       movsx     rax,word ptr [rsp+32]
       movsx     r8,word ptr [rsp+2A]
       cdq
       idiv      r8d
       movsx     r8,ax
       mov       [rsp+3A],r8w
       movsx     rax,word ptr [rsp+34]
       movsx     r8,word ptr [rsp+2C]
       cdq
       idiv      r8d
       movsx     r8,ax
       mov       [rsp+3C],r8w
       movsx     rax,word ptr [rsp+36]
       movsx     r8,word ptr [rsp+2E]
       cdq
       idiv      r8d
       movsx     r8,ax
       mov       [rsp+3E],r8w
       vpcmpeqd  xmm0,xmm0,xmm0
       vmovaps   [rsp+50],xmm0
       vbroadcastss xmm0,dword ptr [7FF9F8D04880]
       vmovaps   [rsp+40],xmm0
       mov       r8,[rsp+38]
       mov       rax,[rsp+58]
       mov       [rsp+18],rax
       mov       rax,[rsp+48]
       mov       [rsp+10],rax
       movsx     rax,word ptr [rsp+18]
       movsx     r10,word ptr [rsp+10]
       cdq
       idiv      r10d
       movsx     r10,ax
       mov       [rsp+20],r10w
       movsx     rax,word ptr [rsp+1A]
       movsx     r10,word ptr [rsp+12]
       cdq
       idiv      r10d
       movsx     r10,ax
       mov       [rsp+22],r10w
       movsx     rax,word ptr [rsp+1C]
       movsx     r10,word ptr [rsp+14]
       cdq
       idiv      r10d
       movsx     r10,ax
       mov       [rsp+24],r10w
       movsx     rax,word ptr [rsp+1E]
       movsx     r10,word ptr [rsp+16]
       cdq
       idiv      r10d
       movsx     r10,ax
       mov       [rsp+26],r10w
       mov       rax,[rsp+20]
       mov       [rsp],r8
       mov       [rsp+8],rax
       vmovaps   xmm0,[rsp]
       vmovups   [rcx],xmm0
       mov       rax,rcx
       add       rsp,88
       ret

Diff:

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.Int16, System.Private.CoreLib]].DivideBenchmark()
       sub       rsp,28
       vpcmpeqd  xmm0,xmm0,xmm0
       vpmovsxwd ymm0,xmm0
       vpmovsxwd ymm1,xmmword ptr [7FF99D554600]
       vxorpd    ymm2,ymm2,ymm2
       vpcmpeqd  ymm2,ymm2,ymm1
       vptest    ymm2,ymm2
       jne       short M00_L00
       vpcmpeqd  ymm2,ymm0,[7FF99D554640]
       vpcmpeqd  ymm3,ymm1,[7FF99D554620]
       vpand     ymm2,ymm2,ymm3
       vptest    ymm2,ymm2
       jne       short M00_L01
       vcvtdq2pd zmm2,ymm0
       vcvtdq2pd zmm3,ymm1
       vdivpd    zmm0,zmm2,zmm3
       vcvttpd2dq ymm0,zmm0
       vpmovdw   xmmword ptr [rdx],ymm0
       mov       rax,rdx
       vzeroupper
       add       rsp,28
       ret
M00_L00:
       call      CORINFO_HELP_THROWDIVZERO
       int       3
M00_L01:
       call      CORINFO_HELP_OVERFLOW
       int       3

System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.DivisionOperatorBenchmark

Base:

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.SByte, System.Private.CoreLib]].DivisionOperatorBenchmark()
       sub       rsp,88
       mov       rcx,rdx
       vpcmpeqd  xmm0,xmm0,xmm0
       vmovaps   [rsp+70],xmm0
       vbroadcastss xmm0,dword ptr [7FF9F8D24550]
       vmovaps   [rsp+60],xmm0
       mov       rax,[rsp+70]
       mov       [rsp+30],rax
       mov       rax,[rsp+60]
       mov       [rsp+28],rax
       xor       r8d,r8d
       nop       word ptr [rax+rax]
M00_L00:
       lea       rax,[rsp+30]
       movsxd    r10,r8d
       movsx     rax,byte ptr [rax+r10]
       lea       rdx,[rsp+28]
       movsx     r9,byte ptr [rdx+r10]
       cdq
       idiv      r9d
       lea       rdx,[rsp+38]
       mov       [rdx+r10],al
       inc       r8d
       cmp       r8d,8
       jl        short M00_L00
       vpcmpeqd  xmm1,xmm1,xmm1
       vmovaps   [rsp+50],xmm1
       vmovaps   [rsp+40],xmm0
       mov       r8,[rsp+38]
       mov       rax,[rsp+58]
       mov       [rsp+18],rax
       mov       rax,[rsp+48]
       mov       [rsp+10],rax
       xor       r10d,r10d
       nop       dword ptr [rax]
M00_L01:
       lea       rax,[rsp+18]
       movsxd    r9,r10d
       movsx     rax,byte ptr [rax+r9]
       lea       rdx,[rsp+10]
       movsx     r11,byte ptr [rdx+r9]
       cdq
       idiv      r11d
       lea       rdx,[rsp+20]
       mov       [rdx+r9],al
       inc       r10d
       cmp       r10d,8
       jl        short M00_L01
       mov       rax,[rsp+20]
       mov       [rsp],r8
       mov       [rsp+8],rax
       vmovaps   xmm0,[rsp]
       vmovups   [rcx],xmm0
       mov       rax,rcx
       add       rsp,88
       ret

Diff:

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.SByte, System.Private.CoreLib]].DivisionOperatorBenchmark()
       sub       rsp,28
       vpcmpeqd  xmm0,xmm0,xmm0
       vpmovsxbd zmm0,xmm0
       vmovaps   zmm1,zmm0
       vpmovsxbd zmm2,xmmword ptr [7FF99D564640]
       vmovaps   zmm3,zmm2
       vxorpd    ymm4,ymm4,ymm4
       vpcmpeqd  ymm4,ymm4,ymm3
       vptest    ymm4,ymm4
       jne       near ptr M00_L00
       vpcmpeqd  ymm4,ymm1,[7FF99D564680]
       vpcmpeqd  ymm5,ymm3,[7FF99D564660]
       vpand     ymm4,ymm4,ymm5
       vptest    ymm4,ymm4
       jne       near ptr M00_L01
       vcvtdq2pd zmm4,ymm1
       vcvtdq2pd zmm5,ymm3
       vdivpd    zmm1,zmm4,zmm5
       vcvttpd2dq ymm1,zmm1
       vextracti32x8 ymm0,zmm0,1
       vextracti32x8 ymm2,zmm2,1
       vxorpd    ymm3,ymm3,ymm3
       vpcmpeqd  ymm3,ymm3,ymm2
       vptest    ymm3,ymm3
       jne       short M00_L00
       vpcmpeqd  ymm3,ymm0,[7FF99D564680]
       vpcmpeqd  ymm4,ymm2,[7FF99D564660]
       vpand     ymm3,ymm3,ymm4
       vptest    ymm3,ymm3
       jne       short M00_L01
       vcvtdq2pd zmm3,ymm0
       vcvtdq2pd zmm4,ymm2
       vdivpd    zmm0,zmm3,zmm4
       vcvttpd2dq ymm0,zmm0
       vinserti32x8 zmm0,zmm1,ymm0,1
       vpmovdb   xmmword ptr [rdx],zmm0
       mov       rax,rdx
       vzeroupper
       add       rsp,28
       ret
M00_L00:
       call      CORINFO_HELP_THROWDIVZERO
       int       3
M00_L01:
       call      CORINFO_HELP_OVERFLOW
       int       3

System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.DivideBenchmark

Base:

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.SByte, System.Private.CoreLib]].DivideBenchmark()
       sub       rsp,88
       mov       rcx,rdx
       vpcmpeqd  xmm0,xmm0,xmm0
       vmovaps   [rsp+70],xmm0
       vbroadcastss xmm0,dword ptr [7FF9F8D046F0]
       vmovaps   [rsp+60],xmm0
       mov       rax,[rsp+70]
       mov       [rsp+30],rax
       mov       rax,[rsp+60]
       mov       [rsp+28],rax
       xor       r8d,r8d
       nop       word ptr [rax+rax]
M00_L00:
       lea       rax,[rsp+30]
       movsxd    r10,r8d
       movsx     rax,byte ptr [rax+r10]
       lea       rdx,[rsp+28]
       movsx     r9,byte ptr [rdx+r10]
       cdq
       idiv      r9d
       lea       rdx,[rsp+38]
       mov       [rdx+r10],al
       inc       r8d
       cmp       r8d,8
       jl        short M00_L00
       vpcmpeqd  xmm1,xmm1,xmm1
       vmovaps   [rsp+50],xmm1
       vmovaps   [rsp+40],xmm0
       mov       r8,[rsp+38]
       mov       rax,[rsp+58]
       mov       [rsp+18],rax
       mov       rax,[rsp+48]
       mov       [rsp+10],rax
       xor       r10d,r10d
       nop       dword ptr [rax]
M00_L01:
       lea       rax,[rsp+18]
       movsxd    r9,r10d
       movsx     rax,byte ptr [rax+r9]
       lea       rdx,[rsp+10]
       movsx     r11,byte ptr [rdx+r9]
       cdq
       idiv      r11d
       lea       rdx,[rsp+20]
       mov       [rdx+r9],al
       inc       r10d
       cmp       r10d,8
       jl        short M00_L01
       mov       rax,[rsp+20]
       mov       [rsp],r8
       mov       [rsp+8],rax
       vmovaps   xmm0,[rsp]
       vmovups   [rcx],xmm0
       mov       rax,rcx
       add       rsp,88
       ret

Diff:

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.SByte, System.Private.CoreLib]].DivideBenchmark()
       sub       rsp,28
       vpcmpeqd  xmm0,xmm0,xmm0
       vpmovsxbd zmm0,xmm0
       vmovaps   zmm1,zmm0
       vpmovsxbd zmm2,xmmword ptr [7FF99D554820]
       vmovaps   zmm3,zmm2
       vxorpd    ymm4,ymm4,ymm4
       vpcmpeqd  ymm4,ymm4,ymm3
       vptest    ymm4,ymm4
       jne       near ptr M00_L00
       vpcmpeqd  ymm4,ymm1,[7FF99D554860]
       vpcmpeqd  ymm5,ymm3,[7FF99D554840]
       vpand     ymm4,ymm4,ymm5
       vptest    ymm4,ymm4
       jne       near ptr M00_L01
       vcvtdq2pd zmm4,ymm1
       vcvtdq2pd zmm5,ymm3
       vdivpd    zmm1,zmm4,zmm5
       vcvttpd2dq ymm1,zmm1
       vextracti32x8 ymm0,zmm0,1
       vextracti32x8 ymm2,zmm2,1
       vxorpd    ymm3,ymm3,ymm3
       vpcmpeqd  ymm3,ymm3,ymm2
       vptest    ymm3,ymm3
       jne       short M00_L00
       vpcmpeqd  ymm3,ymm0,[7FF99D554860]
       vpcmpeqd  ymm4,ymm2,[7FF99D554840]
       vpand     ymm3,ymm3,ymm4
       vptest    ymm3,ymm3
       jne       short M00_L01
       vcvtdq2pd zmm3,ymm0
       vcvtdq2pd zmm4,ymm2
       vdivpd    zmm0,zmm3,zmm4
       vcvttpd2dq ymm0,zmm0
       vinserti32x8 zmm0,zmm1,ymm0,1
       vpmovdb   xmmword ptr [rdx],zmm0
       mov       rax,rdx
       vzeroupper
       add       rsp,28
       ret
M00_L00:
       call      CORINFO_HELP_THROWDIVZERO
       int       3
M00_L01:
       call      CORINFO_HELP_OVERFLOW
       int       3

System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.DivisionOperatorBenchmark

Base:

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.UInt32, System.Private.CoreLib]].DivisionOperatorBenchmark()
       sub       rsp,88
       mov       rcx,rdx
       vpcmpeqd  xmm0,xmm0,xmm0
       vmovaps   [rsp+70],xmm0
       vbroadcastss xmm0,dword ptr [7FF9F8CF44A8]
       vmovaps   [rsp+60],xmm0
       mov       rax,[rsp+70]
       mov       [rsp+30],rax
       mov       rax,[rsp+60]
       mov       [rsp+28],rax
       mov       eax,[rsp+30]
       xor       edx,edx
       div       dword ptr [rsp+28]
       mov       [rsp+38],eax
       mov       eax,[rsp+34]
       xor       edx,edx
       div       dword ptr [rsp+2C]
       mov       [rsp+3C],eax
       vpcmpeqd  xmm0,xmm0,xmm0
       vmovaps   [rsp+50],xmm0
       vbroadcastss xmm0,dword ptr [7FF9F8CF44A8]
       vmovaps   [rsp+40],xmm0
       mov       r8,[rsp+38]
       mov       rax,[rsp+58]
       mov       [rsp+18],rax
       mov       rax,[rsp+48]
       mov       [rsp+10],rax
       mov       eax,[rsp+18]
       xor       edx,edx
       div       dword ptr [rsp+10]
       mov       [rsp+20],eax
       mov       eax,[rsp+1C]
       xor       edx,edx
       div       dword ptr [rsp+14]
       mov       [rsp+24],eax
       mov       rax,[rsp+20]
       mov       [rsp],r8
       mov       [rsp+8],rax
       vmovaps   xmm0,[rsp]
       vmovups   [rcx],xmm0
       mov       rax,rcx
       add       rsp,88
       ret

Diff:

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.UInt32, System.Private.CoreLib]].DivisionOperatorBenchmark()
       sub       rsp,28
       vpcmpeqd  xmm0,xmm0,xmm0
       vbroadcastss xmm1,dword ptr [7FF99D584380]
       vxorpd    xmm2,xmm2,xmm2
       vpcmpeqd  xmm2,xmm2,xmm1
       vptest    xmm2,xmm2
       jne       short M00_L00
       vcvtudq2pd ymm2,xmm0
       vcvtudq2pd ymm3,xmm1
       vdivpd    ymm0,ymm2,ymm3
       vcvttpd2udq xmm0,ymm0
       vmovups   [rdx],xmm0
       mov       rax,rdx
       vzeroupper
       add       rsp,28
       ret
M00_L00:
       call      CORINFO_HELP_THROWDIVZERO
       int       3
       call      CORINFO_HELP_OVERFLOW
       int       3

System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.DivideBenchmark

Base:

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.UInt32, System.Private.CoreLib]].DivideBenchmark()
       sub       rsp,88
       mov       rcx,rdx
       vpcmpeqd  xmm0,xmm0,xmm0
       vmovaps   [rsp+70],xmm0
       vbroadcastss xmm0,dword ptr [7FF9F8D34608]
       vmovaps   [rsp+60],xmm0
       mov       rax,[rsp+70]
       mov       [rsp+30],rax
       mov       rax,[rsp+60]
       mov       [rsp+28],rax
       mov       eax,[rsp+30]
       xor       edx,edx
       div       dword ptr [rsp+28]
       mov       [rsp+38],eax
       mov       eax,[rsp+34]
       xor       edx,edx
       div       dword ptr [rsp+2C]
       mov       [rsp+3C],eax
       vpcmpeqd  xmm0,xmm0,xmm0
       vmovaps   [rsp+50],xmm0
       vbroadcastss xmm0,dword ptr [7FF9F8D34608]
       vmovaps   [rsp+40],xmm0
       mov       r8,[rsp+38]
       mov       rax,[rsp+58]
       mov       [rsp+18],rax
       mov       rax,[rsp+48]
       mov       [rsp+10],rax
       mov       eax,[rsp+18]
       xor       edx,edx
       div       dword ptr [rsp+10]
       mov       [rsp+20],eax
       mov       eax,[rsp+1C]
       xor       edx,edx
       div       dword ptr [rsp+14]
       mov       [rsp+24],eax
       mov       rax,[rsp+20]
       mov       [rsp],r8
       mov       [rsp+8],rax
       vmovaps   xmm0,[rsp]
       vmovups   [rcx],xmm0
       mov       rax,rcx
       add       rsp,88
       ret

Diff:

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.UInt32, System.Private.CoreLib]].DivideBenchmark()
       sub       rsp,28
       vpcmpeqd  xmm0,xmm0,xmm0
       vbroadcastss xmm1,dword ptr [7FF99D5844C0]
       vxorpd    xmm2,xmm2,xmm2
       vpcmpeqd  xmm2,xmm2,xmm1
       vptest    xmm2,xmm2
       jne       short M00_L00
       vcvtudq2pd ymm2,xmm0
       vcvtudq2pd ymm3,xmm1
       vdivpd    ymm0,ymm2,ymm3
       vcvttpd2udq xmm0,ymm0
       vmovups   [rdx],xmm0
       mov       rax,rdx
       vzeroupper
       add       rsp,28
       ret
M00_L00:
       call      CORINFO_HELP_THROWDIVZERO
       int       3
       call      CORINFO_HELP_OVERFLOW
       int       3

@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jul 23, 2025
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jul 23, 2025
@tannergooding
Copy link
Member

Changes generally LGTM. There's a check that I think isn't quite doing what you intended and I listed some ways that we could accelerate on more than just AVX512 capable hardware. I'll let you decide if that's something you want to add support for or if a tracking issue should be opened instead.

@tannergooding tannergooding requested a review from EgorBo July 24, 2025 17:08
@tannergooding tannergooding added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jul 24, 2025
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@tannergooding
Copy link
Member

Changes overall LGTM. Few suggestions on cleanup and simplifications

@alexcovington alexcovington force-pushed the vector-int-divide-short-byte branch from 9db3705 to 81dcd40 Compare September 2, 2025 18:19
Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC. @dotnet/jit-contrib for secondary review

Copy link
Member

@EgorBo EgorBo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
@alexcovington does this have any test coverage on CI?

@EgorBo
Copy link
Member

EgorBo commented Sep 15, 2025

@MihuBot

@tannergooding
Copy link
Member

does this have any test coverage on CI?

It does, which can be seen in some of the SPMI diffs. We have a divide test for each of the 10 core base types: https://github.com/dotnet/runtime/blob/main/src/tests/Common/GenerateHWIntrinsicTests/GenerateHWIntrinsicTests_General.cs#L850-L859

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants