Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: layernorm rms replacement for T5 #107

Merged
merged 10 commits into from
Oct 27, 2022
Merged

feat: layernorm rms replacement for T5 #107

merged 10 commits into from
Oct 27, 2022

Conversation

gaetansnl
Copy link
Contributor

@gaetansnl gaetansnl commented Oct 18, 2022

This PR requires full test run because we modify replacement

@github-actions github-actions bot added feature and removed feature labels Oct 18, 2022
@pommedeterresautee
Copy link
Member

can you check if there is an error in the reference implementation?

@gaetansnl gaetansnl marked this pull request as ready for review October 25, 2022 09:05
@gaetansnl
Copy link
Contributor Author

IMO needs full test and benchmark comparaison on 3090. I will post a10g

@gaetansnl gaetansnl changed the title feat: rms replacement base feat: layernorm rms replacement for T5 Oct 25, 2022
@github-actions github-actions bot added feature and removed feature labels Oct 25, 2022
@gaetansnl
Copy link
Contributor Author

a10g Without replacement

test/test_torchdynamo.py ................................                                                                                                                                                                         [100%]
shape=(1, 128) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x128-t5-small]                      10.8974 (1.0)    11.1234 (1.0)  10.8347 (1.0)  12.2346 (1.0)  12.8946 (1.0)  13.2912 (1.0)  11.5108 (1.0)  15.1023 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x128-t5-small]  1.7125 (6.36)    1.7125 (6.5)   1.7101 (6.34)  1.7148 (7.13)  1.7771 (7.26)  1.7833 (7.45)  1.7644 (6.52)  1.8999 (7.95)

shape=(1, 16) reference_fp32=t5-small
Name                                                                        Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
--------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x16-t5-small]                      10.2884 (1.0)    10.3485 (1.0)  10.1981 (1.0)  10.8974 (1.0)  10.7813 (1.0)  11.0701 (1.0)  10.7284 (1.0)  12.5391 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x16-t5-small]  1.2064 (8.53)    1.2064 (8.58)  1.2045 (8.47)  1.2083 (9.02)  1.266 (8.52)   1.2699 (8.72)  1.2596 (8.52)  1.3924 (9.01)

shape=(1, 256) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x256-t5-small]                      10.2764 (1.0)    10.6193 (1.0)  10.2149 (1.0)  11.9115 (1.0)  10.8126 (1.0)  10.9122 (1.0)  10.6578 (1.0)  11.9531 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x256-t5-small]  2.434 (4.22)     2.4342 (4.36)  2.4318 (4.2)   2.4367 (4.89)  2.4996 (4.33)  2.5131 (4.34)  2.4898 (4.28)  2.6649 (4.49)

shape=(1, 33) reference_fp32=t5-small
Name                                                                        Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)     Median         Mean           Min            Max
--------------------------------------------------------------------------  ---------------  -------------  ------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x33-t5-small]                      9.8122 (1.0)     9.8136 (1.0)   9.7389 (1.0)  9.8744 (1.0)   10.3948 (1.0)  10.4958 (1.0)  10.3239 (1.0)  11.3421 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x33-t5-small]  1.2999 (7.55)    1.2999 (7.55)  1.2981 (7.5)  1.3025 (7.58)  1.3595 (7.65)  1.3627 (7.7)   1.3517 (7.64)  1.4769 (7.68)

shape=(1, 384) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x384-t5-small]                      10.4739 (1.0)    10.5035 (1.0)  10.3692 (1.0)  10.6571 (1.0)  10.9789 (1.0)  11.1108 (1.0)  10.9487 (1.0)  11.9647 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x384-t5-small]  3.0922 (3.39)    3.0926 (3.4)   3.0873 (3.36)  3.0972 (3.44)  3.1538 (3.48)  3.1633 (3.51)  3.1445 (3.48)  3.2996 (3.63)

shape=(1, 512) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x512-t5-small]                      10.4758 (1.0)    10.4756 (1.0)  10.3837 (1.0)  10.602 (1.0)   10.9657 (1.0)  11.1505 (1.0)  10.9168 (1.0)  12.5448 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x512-t5-small]  3.8435 (2.73)    3.8435 (2.73)  3.8371 (2.71)  3.8477 (2.76)  3.9102 (2.8)   3.9153 (2.85)  3.8989 (2.8)   4.0407 (3.1)

shape=(32, 128) reference_fp32=t5-small
Name                                                                          Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
----------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x128-t5-small]                      24.2754 (1.0)    24.274 (1.0)    24.2648 (1.0)   24.2801 (1.0)   24.3513 (1.0)   24.4583 (1.0)   24.3388 (1.0)   24.7558 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x128-t5-small]  17.7034 (1.37)   17.7038 (1.37)  17.6949 (1.37)  17.7172 (1.37)  17.7806 (1.37)  17.8018 (1.37)  17.7695 (1.37)  17.8853 (1.38)

shape=(32, 16) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-32x16-t5-small]                      11.1029 (1.0)    11.3133 (1.0)  10.9093 (1.0)  12.162 (1.0)   11.3779 (1.0)  11.5764 (1.0)  11.3415 (1.0)  12.5898 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x16-t5-small]  2.127 (5.22)     2.1272 (5.32)  2.1229 (5.14)  2.1313 (5.71)  2.1855 (5.21)  2.1901 (5.29)  2.1795 (5.2)   2.3096 (5.45)

shape=(32, 256) reference_fp32=t5-small
Name                                                                          Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
----------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x256-t5-small]                      55.8947 (1.0)    55.8947 (1.0)   55.8947 (1.0)   55.8947 (1.0)   56.7266 (1.0)   56.7266 (1.0)   56.7266 (1.0)   56.7266 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x256-t5-small]  39.1213 (1.43)   39.1244 (1.43)  39.1213 (1.43)  39.1274 (1.43)  39.3141 (1.44)  39.3228 (1.44)  39.3141 (1.44)  39.3315 (1.44)

shape=(32, 33) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean          Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  ------------  -------------  -------------
test_benchmark_implementations[baseline-32x33-t5-small]                      10.3974 (1.0)    10.5949 (1.0)  10.301 (1.0)   12.04 (1.0)    10.8182 (1.0)  10.921 (1.0)  10.7937 (1.0)  11.504 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x33-t5-small]  3.6937 (2.81)    3.6939 (2.87)  3.6882 (2.79)  3.6981 (3.26)  3.7561 (2.88)  3.7643 (2.9)  3.7509 (2.88)  3.8989 (2.95)

shape=(8, 128) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x128-t5-small]                      10.3795 (1.0)    10.4239 (1.0)  10.2703 (1.0)  10.8493 (1.0)  11.1293 (1.0)  11.2902 (1.0)  10.9973 (1.0)  12.1218 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x128-t5-small]  4.1269 (2.52)    4.1267 (2.53)  4.1203 (2.49)  4.1326 (2.63)  4.1976 (2.65)  4.2113 (2.68)  4.1896 (2.62)  4.3273 (2.8)

shape=(8, 16) reference_fp32=t5-small
Name                                                                        Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
--------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x16-t5-small]                      11.3719 (1.0)    11.4684 (1.0)  11.3335 (1.0)  11.8419 (1.0)  12.0279 (1.0)  12.2447 (1.0)  11.9642 (1.0)  13.3369 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x16-t5-small]  1.4376 (7.91)    1.4375 (7.98)  1.4347 (7.9)   1.4404 (8.22)  1.497 (8.03)   1.5005 (8.16)  1.491 (8.02)   1.6273 (8.2)

shape=(8, 256) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -----------
test_benchmark_implementations[baseline-8x256-t5-small]                      14.5639 (1.0)    14.5639 (1.0)  14.5597 (1.0)  14.567 (1.0)   14.9526 (1.0)  15.0748 (1.0)  14.9329 (1.0)  15.54 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x256-t5-small]  9.7372 (1.5)     9.7379 (1.5)   9.728 (1.5)    9.7488 (1.49)  9.8034 (1.53)  9.8208 (1.53)  9.7916 (1.53)  9.96 (1.56)

shape=(8, 33) reference_fp32=t5-small
Name                                                                        Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)    Median         Mean           Min            Max
--------------------------------------------------------------------------  ---------------  -------------  -------------  ------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x33-t5-small]                      11.1895 (1.0)    11.2395 (1.0)  11.082 (1.0)   11.604 (1.0)  11.7356 (1.0)  11.8823 (1.0)  11.6374 (1.0)  12.6045 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x33-t5-small]  1.9298 (5.8)     1.9299 (5.82)  1.9281 (5.75)  1.932 (6.01)  1.9875 (5.9)   1.9925 (5.96)  1.9826 (5.87)  2.1161 (5.96)

shape=(8, 384) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
---------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  -------------
test_benchmark_implementations[baseline-8x384-t5-small]                      27.929 (1.0)     27.919 (1.0)    27.8952 (1.0)   27.9327 (1.0)   28.0908 (1.0)   28.2274 (1.0)   28.0879 (1.0)   28.5035 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x384-t5-small]  17.6612 (1.58)   17.6622 (1.58)  17.6372 (1.58)  17.6796 (1.58)  17.7526 (1.58)  17.7631 (1.59)  17.7249 (1.58)  17.8466 (1.6)

shape=(8, 512) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean           Min             Max
---------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  -------------  --------------  --------------
test_benchmark_implementations[baseline-8x512-t5-small]                      47.5101 (1.0)    47.6515 (1.0)   47.5101 (1.0)   47.7929 (1.0)   47.9705 (1.0)   48.2465 (1.0)  47.9705 (1.0)   48.5224 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x512-t5-small]  26.7525 (1.78)   26.7271 (1.78)  26.6523 (1.78)  26.7764 (1.78)  26.8328 (1.79)  26.84 (1.8)    26.8281 (1.79)  26.8589 (1.81)

@gaetansnl
Copy link
Contributor Author

a10g with rms

shape=(1, 128) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x128-t5-small]                      11.029 (1.0)     11.0453 (1.0)  10.9686 (1.0)  11.1269 (1.0)  11.5238 (1.0)  12.3192 (1.0)  11.4882 (1.0)  15.8571 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x128-t5-small]  1.4235 (7.75)    1.4236 (7.76)  1.4214 (7.72)  1.426 (7.8)    1.4896 (7.74)  1.5029 (8.2)   1.4765 (7.78)  1.5986 (9.92)

shape=(1, 16) reference_fp32=t5-small
Name                                                                        Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)     Median          Mean            Min             Max
--------------------------------------------------------------------------  ---------------  --------------  --------------  -------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-1x16-t5-small]                      10.5959 (1.0)    10.5788 (1.0)   10.3835 (1.0)   10.946 (1.0)   11.0055 (1.0)   11.1147 (1.0)   10.9274 (1.0)   11.774 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x16-t5-small]  0.9577 (11.06)   0.9577 (11.05)  0.9548 (10.88)  0.9598 (11.4)  1.0146 (10.85)  1.0181 (10.92)  1.0086 (10.83)  1.1328 (10.39)

shape=(1, 256) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x256-t5-small]                      10.7452 (1.0)    10.8366 (1.0)  10.3772 (1.0)  12.067 (1.0)   11.1614 (1.0)  11.5054 (1.0)  10.9191 (1.0)  13.2872 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x256-t5-small]  2.1578 (4.98)    2.1577 (5.02)  2.1557 (4.81)  2.1605 (5.59)  2.2151 (5.04)  2.2199 (5.18)  2.2112 (4.94)  2.3451 (5.67)

shape=(1, 33) reference_fp32=t5-small
Name                                                                        Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
--------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x33-t5-small]                      10.0617 (1.0)    10.0719 (1.0)  10.0033 (1.0)  10.1398 (1.0)  10.5688 (1.0)  10.7208 (1.0)  10.507 (1.0)   11.4625 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x33-t5-small]  1.0553 (9.53)    1.0554 (9.54)  1.0521 (9.51)  1.0588 (9.58)  1.1139 (9.49)  1.1188 (9.58)  1.1042 (9.52)  1.2549 (9.13)

shape=(1, 384) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x384-t5-small]                      10.6152 (1.0)    10.6547 (1.0)  10.5708 (1.0)  10.8305 (1.0)  11.201 (1.0)   11.3768 (1.0)  11.142 (1.0)   12.0004 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x384-t5-small]  2.778 (3.82)     2.7777 (3.84)  2.7732 (3.81)  2.7818 (3.89)  2.8404 (3.94)  2.8441 (4.0)   2.8275 (3.94)  2.9572 (4.06)

shape=(1, 512) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x512-t5-small]                      10.6596 (1.0)    10.6931 (1.0)  10.5691 (1.0)  10.908 (1.0)   11.2008 (1.0)  11.4652 (1.0)  11.0623 (1.0)  12.7114 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x512-t5-small]  3.4962 (3.05)    3.4965 (3.06)  3.4911 (3.03)  3.5024 (3.11)  3.5558 (3.15)  3.5601 (3.22)  3.5503 (3.12)  3.6653 (3.47)

shape=(32, 128) reference_fp32=t5-small
Name                                                                          Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
----------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x128-t5-small]                      24.2781 (1.0)    24.2795 (1.0)   24.2635 (1.0)   24.29 (1.0)     24.3616 (1.0)   24.4732 (1.0)   24.3556 (1.0)   24.8117 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x128-t5-small]  13.2036 (1.84)   13.2028 (1.84)  13.1913 (1.84)  13.2106 (1.84)  13.2764 (1.83)  13.2893 (1.84)  13.2639 (1.84)  13.3791 (1.85)

shape=(32, 16) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-32x16-t5-small]                      11.0888 (1.0)    11.1226 (1.0)  11.0654 (1.0)  11.2716 (1.0)  11.8778 (1.0)  12.2726 (1.0)  11.6101 (1.0)  15.0215 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x16-t5-small]  1.7764 (6.24)    1.7763 (6.26)  1.7724 (6.24)  1.7792 (6.34)  1.8328 (6.48)  1.8397 (6.67)  1.8274 (6.35)  1.9587 (7.67)

shape=(32, 256) reference_fp32=t5-small
Name                                                                          Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
----------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x256-t5-small]                      55.9114 (1.0)    55.9114 (1.0)   55.9114 (1.0)   55.9114 (1.0)   56.553 (1.0)    56.553 (1.0)    56.553 (1.0)    56.553 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x256-t5-small]  30.487 (1.83)    30.4806 (1.83)  30.4509 (1.84)  30.5037 (1.83)  30.5607 (1.85)  30.5848 (1.85)  30.5551 (1.85)  30.6387 (1.85)

shape=(32, 33) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-32x33-t5-small]                      10.6136 (1.0)    10.6388 (1.0)  10.5168 (1.0)  10.8556 (1.0)  11.1354 (1.0)  11.1911 (1.0)  11.0141 (1.0)  11.8491 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x33-t5-small]  3.1233 (3.4)     3.1233 (3.41)  3.1189 (3.37)  3.128 (3.47)   3.1838 (3.5)   3.1875 (3.51)  3.1768 (3.47)  3.301 (3.59)

shape=(8, 128) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x128-t5-small]                      10.5112 (1.0)    10.5188 (1.0)  10.364 (1.0)   10.7777 (1.0)  10.9929 (1.0)  11.1348 (1.0)  10.9541 (1.0)  12.1555 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x128-t5-small]  3.5589 (2.95)    3.5586 (2.96)  3.5511 (2.92)  3.5644 (3.02)  3.6237 (3.03)  3.6302 (3.07)  3.612 (3.03)   3.7298 (3.26)

shape=(8, 16) reference_fp32=t5-small
Name                                                                        Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min            Max
--------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  -------------  -------------
test_benchmark_implementations[baseline-8x16-t5-small]                      11.6463 (1.0)    11.6482 (1.0)   11.5398 (1.0)   11.8025 (1.0)   12.0958 (1.0)   12.2655 (1.0)   11.9916 (1.0)  13.0905 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x16-t5-small]  1.1495 (10.13)   1.1495 (10.13)  1.1468 (10.06)  1.1515 (10.25)  1.2064 (10.03)  1.2097 (10.14)  1.1996 (10.0)  1.3237 (9.89)

shape=(8, 256) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean          Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  ------------  -------------  -------------
test_benchmark_implementations[baseline-8x256-t5-small]                      14.5856 (1.0)    14.586 (1.0)   14.5731 (1.0)  14.601 (1.0)   15.1165 (1.0)  15.168 (1.0)  14.9603 (1.0)  15.5511 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x256-t5-small]  7.9708 (1.83)    7.9715 (1.83)  7.9562 (1.83)  7.9966 (1.83)  8.0344 (1.88)  8.045 (1.89)  8.0172 (1.87)  8.1483 (1.91)

shape=(8, 33) reference_fp32=t5-small
Name                                                                        Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
--------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x33-t5-small]                      11.4275 (1.0)    11.5003 (1.0)  11.3355 (1.0)  11.7752 (1.0)  11.9373 (1.0)  12.1501 (1.0)  11.8968 (1.0)  13.5944 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x33-t5-small]  1.6373 (6.98)    1.6375 (7.02)  1.6356 (6.93)  1.6396 (7.18)  1.6983 (7.03)  1.7026 (7.14)  1.6896 (7.04)  1.8334 (7.41)

shape=(8, 384) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
---------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x384-t5-small]                      27.9311 (1.0)    27.9374 (1.0)   27.9285 (1.0)   27.9526 (1.0)   28.3195 (1.0)   28.3491 (1.0)   28.1371 (1.0)   28.5907 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x384-t5-small]  14.2856 (1.96)   14.2857 (1.96)  14.2685 (1.96)  14.3028 (1.95)  14.3354 (1.98)  14.3622 (1.97)  14.3136 (1.97)  14.4246 (1.98)

shape=(8, 512) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
---------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x512-t5-small]                      47.4965 (1.0)    47.6494 (1.0)   47.4965 (1.0)   47.8023 (1.0)   48.1978 (1.0)   48.4258 (1.0)   48.1978 (1.0)   48.6537 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x512-t5-small]  22.015 (2.16)    22.0358 (2.16)  21.9808 (2.16)  22.0826 (2.16)  22.0265 (2.19)  22.0977 (2.19)  22.0088 (2.19)  22.3052 (2.18)


@github-actions github-actions bot added feature and removed feature labels Oct 25, 2022
@gaetansnl
Copy link
Contributor Author

gaetansnl commented Oct 25, 2022

a10g BERT feat/rms-replacement branch for regression

shape=(1, 128) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x128-bert-base-uncased]                      6.3768 (1.0)     6.9185 (1.0)   6.2511 (1.0)   8.3863 (1.0)   6.8144 (1.0)   6.9155 (1.0)   6.7677 (1.0)   8.009 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x128-bert-base-uncased]  1.6157 (3.95)    1.6156 (4.28)  1.6137 (3.87)  1.6178 (5.18)  1.6695 (4.08)  1.6734 (4.13)  1.6628 (4.07)  1.7979 (4.45)

shape=(1, 16) reference_fp32=bert-base-uncased
Name                                                                                 Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
-----------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x16-bert-base-uncased]                      6.1886 (1.0)     6.2148 (1.0)   6.1249 (1.0)   6.4814 (1.0)   6.6933 (1.0)   6.7711 (1.0)   6.6501 (1.0)   7.4266 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x16-bert-base-uncased]  0.7631 (8.11)    0.7632 (8.14)  0.7608 (8.05)  0.7665 (8.46)  0.8148 (8.21)  0.8175 (8.28)  0.8093 (8.22)  0.9219 (8.06)

shape=(1, 256) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)    Median         Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  -------------  ------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x256-bert-base-uncased]                      6.2659 (1.0)     6.2773 (1.0)   6.1705 (1.0)   6.4937 (1.0)  6.812 (1.0)    7.0869 (1.0)   6.7092 (1.0)   8.4995 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x256-bert-base-uncased]  1.9217 (3.26)    1.9217 (3.27)  1.9181 (3.22)  1.925 (3.37)  1.9767 (3.45)  1.9811 (3.58)  1.9697 (3.41)  2.0793 (4.09)

shape=(1, 33) reference_fp32=bert-base-uncased
Name                                                                                 Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean          Min            Max
-----------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  ------------  -------------  -------------
test_benchmark_implementations[baseline-1x33-bert-base-uncased]                      6.3605 (1.0)     6.3749 (1.0)   6.333 (1.0)    6.4586 (1.0)   6.95 (1.0)     7.1235 (1.0)  6.87 (1.0)     8.7455 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x33-bert-base-uncased]  0.8979 (7.08)    0.8979 (7.1)   0.8956 (7.07)  0.9007 (7.17)  0.9495 (7.32)  0.954 (7.47)  0.9436 (7.28)  1.0593 (8.26)

shape=(1, 384) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x384-bert-base-uncased]                      6.3904 (1.0)     6.4845 (1.0)   6.3237 (1.0)   7.2649 (1.0)   6.8681 (1.0)   6.9416 (1.0)   6.8218 (1.0)   7.7629 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x384-bert-base-uncased]  2.1564 (2.96)    2.1567 (3.01)  2.1528 (2.94)  2.1602 (3.36)  2.2105 (3.11)  2.2162 (3.13)  2.2063 (3.09)  2.3393 (3.32)

shape=(1, 512) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x512-bert-base-uncased]                      6.032 (1.0)      6.0298 (1.0)   5.9541 (1.0)   6.133 (1.0)    6.601 (1.0)    6.8261 (1.0)   6.5124 (1.0)   8.4258 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x512-bert-base-uncased]  3.0147 (2.0)     3.0151 (2.0)   3.0112 (1.98)  3.0203 (2.03)  3.0723 (2.15)  3.0772 (2.22)  3.0663 (2.12)  3.1782 (2.65)

shape=(32, 128) reference_fp32=bert-base-uncased
Name                                                                                   Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
-------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x128-bert-base-uncased]                      26.7683 (1.0)    26.7678 (1.0)   26.7457 (1.0)   26.7895 (1.0)   26.9032 (1.0)   26.9335 (1.0)   26.8283 (1.0)   27.0688 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x128-bert-base-uncased]  14.2598 (1.88)   14.2607 (1.88)  14.2554 (1.88)  14.2687 (1.88)  14.3311 (1.88)  14.3507 (1.88)  14.3261 (1.87)  14.4651 (1.87)

shape=(32, 16) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)    Median         Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  -------------  ------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-32x16-bert-base-uncased]                      6.4059 (1.0)     6.4324 (1.0)   6.3587 (1.0)   6.6748 (1.0)  6.9709 (1.0)   7.2847 (1.0)   6.8709 (1.0)   9.3627 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x16-bert-base-uncased]  2.6115 (2.45)    2.6115 (2.46)  2.6085 (2.44)  2.616 (2.55)  2.6687 (2.61)  2.6731 (2.73)  2.6593 (2.58)  2.7939 (3.35)

shape=(32, 256) reference_fp32=bert-base-uncased
Name                                                                                   Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)    Median          Mean            Min             Max
-------------------------------------------------------------------------------------  ---------------  -------------  -------------  ------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x256-bert-base-uncased]                      59.766 (1.0)     59.766 (1.0)   59.766 (1.0)   59.766 (1.0)  60.678 (1.0)    60.678 (1.0)    60.678 (1.0)    60.678 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x256-bert-base-uncased]  28.4519 (2.1)    28.4496 (2.1)  28.4429 (2.1)  28.454 (2.1)  28.3177 (2.14)  28.3557 (2.14)  28.3151 (2.14)  28.4343 (2.13)

shape=(32, 33) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median        Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  ------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-32x33-bert-base-uncased]                      7.4187 (1.0)     7.4177 (1.0)   7.4107 (1.0)   7.4234 (1.0)   7.7245 (1.0)  7.7755 (1.0)   7.7064 (1.0)   8.3104 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x33-bert-base-uncased]  4.9207 (1.51)    4.921 (1.51)   4.9152 (1.51)  4.9261 (1.51)  4.98 (1.55)   4.9858 (1.56)  4.9735 (1.55)  5.0863 (1.63)

shape=(8, 128) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  ------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x128-bert-base-uncased]                      7.8429 (1.0)     7.8438 (1.0)   7.839 (1.0)   7.8521 (1.0)   8.1116 (1.0)   8.1607 (1.0)   8.1048 (1.0)   8.5234 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x128-bert-base-uncased]  4.807 (1.63)     4.8067 (1.63)  4.803 (1.63)  4.8089 (1.63)  4.8633 (1.67)  4.8688 (1.68)  4.8541 (1.67)  4.9637 (1.72)

shape=(8, 16) reference_fp32=bert-base-uncased
Name                                                                                 Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
-----------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  ------------
test_benchmark_implementations[baseline-8x16-bert-base-uncased]                      7.2084 (1.0)     7.7711 (1.0)   6.6948 (1.0)   9.7749 (1.0)   7.1956 (1.0)   7.2967 (1.0)   7.1063 (1.0)   8.5743 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x16-bert-base-uncased]  1.5118 (4.77)    1.5117 (5.14)  1.5083 (4.44)  1.5145 (6.45)  1.5616 (4.61)  1.5637 (4.67)  1.5583 (4.56)  1.669 (5.14)

shape=(8, 256) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x256-bert-base-uncased]                      17.603 (1.0)     17.6065 (1.0)  17.5904 (1.0)  17.6345 (1.0)  17.7363 (1.0)  17.7816 (1.0)  17.7234 (1.0)  17.9846 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x256-bert-base-uncased]  7.954 (2.21)     7.9557 (2.21)  7.9492 (2.21)  7.9723 (2.21)  8.0174 (2.21)  8.0355 (2.21)  8.0111 (2.21)  8.1668 (2.2)

shape=(8, 33) reference_fp32=bert-base-uncased
Name                                                                                 Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median        Mean           Min            Max
-----------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  ------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x33-bert-base-uncased]                      6.4695 (1.0)     6.4796 (1.0)   6.3939 (1.0)   6.6359 (1.0)   7.0264 (1.0)  7.0725 (1.0)   6.9219 (1.0)   7.6823 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x33-bert-base-uncased]  1.8465 (3.5)     1.8467 (3.51)  1.8424 (3.47)  1.8515 (3.58)  1.8987 (3.7)  1.9013 (3.72)  1.8919 (3.66)  2.0079 (3.83)

shape=(8, 384) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean           Min             Max
------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  -------------  --------------  --------------
test_benchmark_implementations[baseline-8x384-bert-base-uncased]                      26.6721 (1.0)    26.6725 (1.0)   26.6649 (1.0)   26.6805 (1.0)   26.748 (1.0)    26.8092 (1.0)  26.6933 (1.0)   26.9863 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x384-bert-base-uncased]  11.3699 (2.35)   11.3739 (2.35)  11.3636 (2.35)  11.4051 (2.34)  11.4345 (2.34)  11.464 (2.34)  11.4248 (2.34)  11.6134 (2.32)

shape=(8, 512) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x512-bert-base-uncased]                      40.1401 (1.0)    40.1565 (1.0)   40.1401 (1.0)   40.1729 (1.0)   40.2056 (1.0)   40.3196 (1.0)   40.2056 (1.0)   40.4336 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x512-bert-base-uncased]  15.1022 (2.66)   15.1223 (2.66)  15.0967 (2.66)  15.2206 (2.64)  15.1669 (2.65)  15.2034 (2.65)  15.1637 (2.65)  15.3519 (2.63)

@gaetansnl
Copy link
Contributor Author

gaetansnl commented Oct 25, 2022

a10g BERT current main branch for regression

shape=(1, 128) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)     Median        Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  ------------  -------------  ------------  -------------  -------------  ------------
test_benchmark_implementations[baseline-1x128-bert-base-uncased]                      6.4745 (1.0)     6.5347 (1.0)   6.4249 (1.0)  6.8636 (1.0)   7.0102 (1.0)  7.073 (1.0)    6.9157 (1.0)   7.9081 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x128-bert-base-uncased]  1.6136 (4.01)    1.6137 (4.05)  1.611 (3.99)  1.6166 (4.25)  1.6692 (4.2)  1.6725 (4.23)  1.6633 (4.16)  1.771 (4.47)

shape=(1, 16) reference_fp32=bert-base-uncased
Name                                                                                 Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
-----------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x16-bert-base-uncased]                      6.3166 (1.0)     6.3661 (1.0)   6.2621 (1.0)   6.6915 (1.0)   6.9763 (1.0)   7.0185 (1.0)   6.777 (1.0)    7.5827 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x16-bert-base-uncased]  0.763 (8.28)     0.763 (8.34)   0.7612 (8.23)  0.7649 (8.75)  0.8172 (8.54)  0.8222 (8.54)  0.8113 (8.35)  0.9263 (8.19)

shape=(1, 256) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x256-bert-base-uncased]                      6.4301 (1.0)     6.4301 (1.0)   6.3726 (1.0)   6.5064 (1.0)   6.8834 (1.0)   6.9537 (1.0)   6.8297 (1.0)   7.7801 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x256-bert-base-uncased]  1.9245 (3.34)    1.9244 (3.34)  1.9198 (3.32)  1.9274 (3.38)  1.9832 (3.47)  1.9931 (3.49)  1.9739 (3.46)  2.0981 (3.71)

shape=(1, 33) reference_fp32=bert-base-uncased
Name                                                                                 Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)      Median         Mean           Min            Max
-----------------------------------------------------------------------------------  ---------------  -------------  -------------  --------------  -------------  -------------  -------------  ------------
test_benchmark_implementations[baseline-1x33-bert-base-uncased]                      7.3055 (1.0)     7.5118 (1.0)   6.5675 (1.0)   10.5772 (1.0)   7.0373 (1.0)   7.0921 (1.0)   6.9857 (1.0)   7.7935 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x33-bert-base-uncased]  0.8978 (8.14)    0.8977 (8.37)  0.8951 (7.34)  0.9001 (11.75)  0.9564 (7.36)  0.9592 (7.39)  0.9462 (7.38)  1.06 (7.35)

shape=(1, 384) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean          Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  ------------  -------------  -------------
test_benchmark_implementations[baseline-1x384-bert-base-uncased]                      6.5506 (1.0)     6.7648 (1.0)   6.4632 (1.0)   8.3688 (1.0)   7.0997 (1.0)   7.2304 (1.0)  7.0263 (1.0)   7.9431 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x384-bert-base-uncased]  2.1527 (3.04)    2.1526 (3.14)  2.1493 (3.01)  2.1551 (3.88)  2.2093 (3.21)  2.217 (3.26)  2.2028 (3.19)  2.3563 (3.37)

shape=(1, 512) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)    Median         Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  -------------  ------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x512-bert-base-uncased]                      6.1592 (1.0)     6.183 (1.0)    6.1239 (1.0)   6.2917 (1.0)  6.7093 (1.0)   6.9006 (1.0)   6.6471 (1.0)   8.5638 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x512-bert-base-uncased]  3.0143 (2.04)    3.0142 (2.05)  3.0116 (2.03)  3.018 (2.08)  3.0832 (2.18)  3.0863 (2.24)  3.0631 (2.17)  3.1934 (2.68)

shape=(32, 128) reference_fp32=bert-base-uncased
Name                                                                                   Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean           Min             Max
-------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  -------------  --------------  --------------
test_benchmark_implementations[baseline-32x128-bert-base-uncased]                      26.787 (1.0)     26.788 (1.0)    26.7816 (1.0)   26.7953 (1.0)   26.9505 (1.0)   26.9582 (1.0)  26.8213 (1.0)   27.1029 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x128-bert-base-uncased]  14.2549 (1.88)   14.2576 (1.88)  14.2537 (1.88)  14.2668 (1.88)  14.3316 (1.88)  14.351 (1.88)  14.3231 (1.87)  14.4744 (1.87)

shape=(32, 16) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-32x16-bert-base-uncased]                      6.5405 (1.0)     6.5564 (1.0)   6.4965 (1.0)   6.6395 (1.0)   7.0734 (1.0)   7.2476 (1.0)   7.0176 (1.0)   9.3778 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x16-bert-base-uncased]  2.6129 (2.5)     2.6132 (2.51)  2.6103 (2.49)  2.6183 (2.54)  2.6704 (2.65)  2.6739 (2.71)  2.6638 (2.63)  2.7846 (3.37)

shape=(32, 256) reference_fp32=bert-base-uncased
Name                                                                                   Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median          Mean            Min             Max
-------------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x256-bert-base-uncased]                      59.7668 (1.0)    59.7668 (1.0)  59.7668 (1.0)  59.7668 (1.0)  60.705 (1.0)    60.705 (1.0)    60.705 (1.0)    60.705 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x256-bert-base-uncased]  28.4928 (2.1)    28.4276 (2.1)  28.264 (2.11)  28.5261 (2.1)  28.4379 (2.13)  28.4149 (2.14)  28.2548 (2.15)  28.5519 (2.13)

shape=(32, 33) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)    Median         Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  ------------  ------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-32x33-bert-base-uncased]                      7.4211 (1.0)     7.4207 (1.0)   7.4165 (1.0)  7.4247 (1.0)  7.7488 (1.0)   7.7872 (1.0)   7.7255 (1.0)   8.3273 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x33-bert-base-uncased]  4.951 (1.5)      4.9513 (1.5)   4.9477 (1.5)  4.9562 (1.5)  5.0112 (1.55)  5.0182 (1.55)  5.0035 (1.54)  5.1176 (1.63)

shape=(8, 128) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x128-bert-base-uncased]                      7.8344 (1.0)     7.8348 (1.0)   7.829 (1.0)    7.8419 (1.0)   8.1249 (1.0)   8.1592 (1.0)   8.1012 (1.0)   8.5067 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x128-bert-base-uncased]  4.8204 (1.63)    4.8202 (1.63)  4.8149 (1.63)  4.8251 (1.63)  4.8849 (1.66)  4.9037 (1.66)  4.8748 (1.66)  4.9854 (1.71)

shape=(8, 16) reference_fp32=bert-base-uncased
Name                                                                                 Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)    Median         Mean           Min           Max
-----------------------------------------------------------------------------------  ---------------  -------------  -------------  ------------  -------------  -------------  ------------  -------------
test_benchmark_implementations[baseline-8x16-bert-base-uncased]                      6.8668 (1.0)     6.9322 (1.0)   6.7848 (1.0)   7.2879 (1.0)  7.4401 (1.0)   7.5709 (1.0)   7.3452 (1.0)  8.9539 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x16-bert-base-uncased]  1.5163 (4.53)    1.5162 (4.57)  1.5135 (4.48)  1.5192 (4.8)  1.5694 (4.74)  1.5724 (4.81)  1.5638 (4.7)  1.6774 (5.34)

shape=(8, 256) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x256-bert-base-uncased]                      17.682 (1.0)     17.681 (1.0)   17.6727 (1.0)  17.6869 (1.0)  17.8151 (1.0)  17.863 (1.0)   17.7961 (1.0)  18.0392 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x256-bert-base-uncased]  7.9689 (2.22)    7.9689 (2.22)  7.9604 (2.22)  7.9799 (2.22)  8.0328 (2.22)  8.0464 (2.22)  8.0255 (2.22)  8.1786 (2.21)

shape=(8, 33) reference_fp32=bert-base-uncased
Name                                                                                 Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min           Max
-----------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  ------------  -------------
test_benchmark_implementations[baseline-8x33-bert-base-uncased]                      6.5898 (1.0)     6.6062 (1.0)   6.5476 (1.0)   6.6883 (1.0)   7.1271 (1.0)   7.2005 (1.0)   7.0563 (1.0)  7.7295 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x33-bert-base-uncased]  1.8494 (3.56)    1.8495 (3.57)  1.8473 (3.54)  1.8528 (3.61)  1.9043 (3.74)  1.9088 (3.77)  1.898 (3.72)  2.0123 (3.84)

shape=(8, 384) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median         Mean            Min             Max
------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  -------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x384-bert-base-uncased]                      26.6523 (1.0)    26.6516 (1.0)   26.6452 (1.0)   26.6573 (1.0)   26.7573 (1.0)  26.8159 (1.0)   26.7286 (1.0)   26.9619 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x384-bert-base-uncased]  11.3882 (2.34)   11.3907 (2.34)  11.3846 (2.34)  11.3986 (2.34)  11.455 (2.34)  11.4695 (2.34)  11.4457 (2.34)  11.5579 (2.33)

shape=(8, 512) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median         Mean            Min             Max
------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  -------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x512-bert-base-uncased]                      40.1021 (1.0)    40.1257 (1.0)   40.1021 (1.0)   40.1493 (1.0)   40.1615 (1.0)  40.2904 (1.0)   40.1615 (1.0)   40.4193 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x512-bert-base-uncased]  15.1029 (2.66)   15.1062 (2.66)  15.0979 (2.66)  15.1165 (2.66)  15.188 (2.64)  15.2321 (2.65)  15.1684 (2.65)  15.3538 (2.63)

@github-actions github-actions bot added feature and removed feature labels Oct 25, 2022
@pommedeterresautee
Copy link
Member

test pass

=========================================================================================================== warnings summary ===========================================================================================================
conftest.py:41
  /mnt/workspace/kernl/conftest.py:41: PytestDeprecationWarning: The hookimpl pytest_configure uses old-style configuration options (marks or attributes).
  Please use the pytest.hookimpl(trylast=True) decorator instead
   to configure the hooks.
   See https://docs.pytest.org/en/latest/deprecations.html#configuring-hook-specs-impls-using-markers
    @pytest.mark.trylast

test/test_debugger.py::test_matmul
  /mnt/workspace/kernl/test/test_debugger.py:172: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
    group_id = pid // num_pid_in_group

test/test_debugger.py::test_matmul
  /mnt/workspace/kernl/test/test_debugger.py:176: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
    pid_n = (pid % num_pid_in_group) // group_size_m

test/test_torchdynamo.py::test_t5
  /home/geantvert/.local/share/virtualenvs/kernl/lib/python3.9/site-packages/transformers/models/t5/tokenization_t5_fast.py:156: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
  For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
  - Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
  - If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
  - To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================================================== 2461 passed, 356 skipped, 4 warnings in 6722.78s (1:52:02) ======================================================================================

Copy link
Member

@pommedeterresautee pommedeterresautee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix imports

@@ -15,6 +15,8 @@

import torch

from src.kernl.implementations.layer_norm import _layer_norm_fwd_fused_single_pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove src.

@@ -15,6 +15,8 @@

import torch

from src.kernl.optimizer.layer_norm import replace_layer_norm_rms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove src.

Copy link
Member

@pommedeterresautee pommedeterresautee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
checked good speedup e2e

@gaetansnl gaetansnl merged commit 1463e39 into main Oct 27, 2022
@gaetansnl gaetansnl deleted the feat/rms-replacement branch October 27, 2022 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants