Skip to content

Commit

Permalink
chore: bump crate-ci/typos from 1.24.5 to 1.24.6
Browse files Browse the repository at this point in the history
Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.5 to 1.24.6.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.24.5...v1.24.6)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
  • Loading branch information
dependabot[bot] authored and avik-pal committed Sep 23, 2024
1 parent 99fc6ac commit 350b7c7
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion .github/workflows/QualityCheck.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@ jobs:
- name: Checkout Actions Repository
uses: actions/checkout@v4
- name: Check spelling
uses: crate-ci/typos@v1.24.5
uses: crate-ci/typos@v1.24.6

1 comment on commit 350b7c7

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LuxLib Benchmarks

Benchmark suite Current: 350b7c7 Previous: 99fc6ac Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 7625 ns 7000 ns 1.09
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 7333 ns 5874.5 ns 1.25
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7437 ns 8250 ns 0.90
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5500 ns 5625 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 88183 ns 88896 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 2389684 ns
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 405334 ns 400425 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9916.5 ns 9958 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9542 ns 9708 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9792 ns 9875 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10000 ns 9979.5 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 383362 ns 370778 ns 1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 17679354 ns
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 677366 ns 665927 ns 1.02
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 2334 ns 1249.5 ns 1.87
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1500 ns 3000 ns 0.50
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1688 ns 1959 ns 0.86
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1729.5 ns 1687.5 ns 1.02
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 14281 ns 13908 ns 1.03
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI 1297688 ns
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU 30200 ns 30060 ns 1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4271 ns 3959 ns 1.08
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4458 ns 4291 ns 1.04
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 3750 ns 3875 ns 0.97
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 3917 ns 4375 ns 0.90
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 106099.5 ns 104640 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI 9298154.5 ns
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU 144956.5 ns 145602 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57333 ns 58042 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46750 ns 39708.5 ns 1.18
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46250 ns 40084 ns 1.15
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83708 ns 82708 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 30588.5 ns 30831 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 572856.5 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 77970 ns 79190 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2018916 ns 2061042 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2087937.5 ns 2079750 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2087229 ns 2084916 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1997063 ns 2001229 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 182309 ns 181552 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 7656207 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1482305 ns 1440455 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 146584 ns 148042 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 174667 ns 148000 ns 1.18
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 149333.5 ns 155708 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 178791.5 ns 176313 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 167232 ns 168318 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 9038666 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 197432 ns 203247.5 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1107750.5 ns 1122729.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1114208 ns 1119625 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1117604.5 ns 1125833 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1114000.5 ns 1123854.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 537253 ns 539424 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 35616369 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1026475 ns 912000 ns 1.13
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5291 ns 4625 ns 1.14
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4645.5 ns 5084 ns 0.91
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5541 ns 6125 ns 0.90
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4166 ns 4125 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 60281 ns 60787 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 5328970.5 ns
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 70560 ns 67560 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8562.5 ns 8500 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8459 ns 8584 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9166.5 ns 8667 ns 1.06
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8750 ns 8417 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 414715.5 ns 418528 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 33923657 ns
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 387834 ns 384969 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17792 ns 17542 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17708 ns 17542 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21500 ns 20458 ns 1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17208.5 ns 18770.5 ns 0.92
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 60282.5 ns 59728.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3008486 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 75721 ns 76240 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212333 ns 224208 ns 0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 212458 ns 219500 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 213521 ns 221312.5 ns 0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 222750 ns 213000 ns 1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 291687 ns 293183.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 14295306 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 471954.5 ns 463935 ns 1.02
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 583 ns 667 ns 0.87
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 583 ns 625 ns 0.93
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 792 ns 916 ns 0.86
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 583.5 ns 625 ns 0.93
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 13225 ns 13248 ns 1.00
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI 1210151 ns
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU 30961 ns 30930 ns 1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1541 ns 1459 ns 1.06
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1542 ns 1417 ns 1.09
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1542 ns 1417 ns 1.09
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1416.5 ns 1417 ns 1.00
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 92964 ns 92361 ns 1.01
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI 9171879 ns
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU 134891 ns 136232 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7375 ns 7417 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6125 ns 5333 ns 1.15
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6167 ns 5416 ns 1.14
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10125 ns 10375 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 18616 ns 18749 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1243379.5 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 46921 ns 48581 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 263062 ns 231083 ns 1.14
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 240459 ns 237166.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228792 ns 241042 ns 0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 237750 ns 255583 ns 0.93
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 154023 ns 154979 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 32407548 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 637591 ns 646107 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4083 ns 4125 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4125 ns 4084 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4167 ns 4125 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4083 ns 4084 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 20561 ns 19985 ns 1.03
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI 2115667 ns
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU 46550 ns 46780 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 17125 ns 16458 ns 1.04
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16750 ns 16500 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16958 ns 16625 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16416 ns 16791 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 174545.5 ns 176107 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI 10156857.5 ns
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU 173982 ns 175202 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 509375 ns 511792 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 405541 ns 331959 ns 1.22
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 404292 ns 332000 ns 1.22
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 864750 ns 865083 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 117562 ns 116899.5 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI 397557 ns
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU 240702 ns 241233 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2318458 ns 2275354 ns 1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 2034500 ns 1753833 ns 1.16
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 2032084 ns 1758916 ns 1.16
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3191167 ns 3193500 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 202548 ns 203284.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI 11415659 ns
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 739097 ns 738868 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5979.5 ns 7459 ns 0.80
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6312.5 ns 6854.5 ns 0.92
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8542 ns 6895.5 ns 1.24
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6542 ns 6459 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 84957.5 ns 84654 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 5409712 ns
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 66831 ns 65201 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11937.5 ns 11604 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11541.5 ns 11125 ns 1.04
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11604 ns 12083 ns 0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10583 ns 12021 ns 0.88
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 561493 ns 566453.5 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 37617116 ns
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 405534 ns 408354 ns 0.99
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 500 ns 541 ns 0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 583 ns 0.93
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 541 ns 0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 20286 ns 20386 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI 2161771 ns
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU 51190 ns 47011 ns 1.09
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2084 ns 2125 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2084 ns 2083 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2208 ns 2166 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2084 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 223022 ns 228468 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI 10990252.5 ns
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU 182361 ns 179272 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 8625 ns 8250 ns 1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9520.5 ns 8833 ns 1.08
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 9334 ns 9292 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7917 ns 8875 ns 0.89
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 108611 ns 107454 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 3137439.5 ns
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 74611 ns 74891 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 18395.5 ns 16812.5 ns 1.09
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 16917 ns 17750 ns 0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18854 ns 19271 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 18396 ns 17791.5 ns 1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 518312 ns 534728 ns 0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 16860013 ns
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 380934 ns 378084 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 458 ns 500 ns 0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 500 ns 500 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 708 ns 625 ns 1.13
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 500 ns 500 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 27063 ns 27220 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 1178178 ns
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 46160 ns 48461 ns 0.95
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8500 ns 10021 ns 0.85
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9020.5 ns 9125 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9208.5 ns 9584 ns 0.96
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8937.5 ns 9729 ns 0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 166677 ns 168737.5 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 18801518 ns
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 371663 ns 367733.5 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397208 ns 399000 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288208.5 ns 215542 ns 1.34
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288000 ns 215541 ns 1.34
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756583 ns 756208 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 110755 ns 110802 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI 333813 ns
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU 75971 ns 76450 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1448374.5 ns 1398875 ns 1.04
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1133083 ns 858375 ns 1.32
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1131833 ns 861479 ns 1.31
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2357875 ns 2355542 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 177520.5 ns 178308 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI 10029153 ns
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU 322173 ns 321323 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7291.5 ns 7354 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6875 ns 7042 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8666 ns 8666.5 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7208 ns 7563 ns 0.95
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 110478 ns 114410.5 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 5505252 ns
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 65640 ns 65791 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 12145.5 ns 13354.5 ns 0.91
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14167 ns 13542 ns 1.05
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 13792 ns 15667 ns 0.88
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14729 ns 14979 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 664318.5 ns 689799.5 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 42216111.5 ns
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 426745 ns 423374 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24770.5 ns 25770.5 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 28375 ns 25875 ns 1.10
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 30459 ns 29083 ns 1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 25729.5 ns 27854 ns 0.92
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 167386 ns 168075.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7615563 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 113401 ns 114031 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 151292 ns 118417 ns 1.28
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 151187.5 ns 119041 ns 1.27
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 153583 ns 141458.5 ns 1.09
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 143875 ns 155166 ns 0.93
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 857621 ns 861211 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 44631154 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 587816 ns 582431 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 79833 ns 74666 ns 1.07
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 85583.5 ns 75750 ns 1.13
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 80437 ns 84875 ns 0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 73583 ns 77084 ns 0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 168427.5 ns 169153 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7736056 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 129412 ns 126942 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 285333 ns 278291 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 300667 ns 305021 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 300791.5 ns 305833 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 222625 ns 287270.5 ns 0.77
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 971830 ns 972909 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 41332252 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 696216 ns 695847 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 17000 ns 16917 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 16833 ns 17000 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 17125 ns 18354.5 ns 0.93
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 16542 ns 16458 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 112981 ns 113778 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 5793916 ns
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 231572 ns 231482 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 28083.5 ns 27604.5 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26500 ns 25875 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 28083.5 ns 26958.5 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 27187.5 ns 28166.5 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 696173.5 ns 702837 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 41169551 ns
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 689617 ns 696858 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 10292 ns 10375 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 11333.5 ns 10875 ns 1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 11750 ns 13625 ns 0.86
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 10250 ns 10625 ns 0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 111360 ns 112473.5 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 3372766 ns
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 235923 ns 236187.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 23687.5 ns 21583 ns 1.10
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21375 ns 22396 ns 0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 22583 ns 22250 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 22375 ns 22041 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 554045 ns 556668 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 22400526.5 ns
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 674936 ns 670387 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 63875 ns 65542 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 65292 ns 64437.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 66458 ns 66333 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 62667 ns 66167 ns 0.95
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 96846 ns 96734 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3400257 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 235422 ns 232362 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 437167 ns 437459 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 485500 ns 479417 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 486250 ns 438167 ns 1.11
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 442291 ns 498625 ns 0.89
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 440935 ns 442769 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 20393573 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 716017 ns 712032 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7208 ns 7562.5 ns 0.95
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7250 ns 7625 ns 0.95
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8646 ns 8125 ns 1.06
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6812.5 ns 7250 ns 0.94
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 113059.5 ns 113892.5 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 5983032 ns
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 64461 ns 69331 ns 0.93
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 11875 ns 14334 ns 0.83
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 13583 ns 14500 ns 0.94
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14854.5 ns 16562 ns 0.90
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14750 ns 11709 ns 1.26
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 670072 ns 675585.5 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 40018921 ns
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 400084 ns 399579 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 6149145.5 ns 6158208 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 6373791 ns 3224959 ns 1.98
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 6369958 ns 3225125 ns 1.98
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 11914917 ns 11921125 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA 348199 ns 347611.5 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/oneAPI 55221895 ns
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU 318854 ns 322793 ns 0.99
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 19112395.5 ns 19113166.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 19954875 ns 11081437.5 ns 1.80
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 19933333 ns 11182250 ns 1.78
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 36546937.5 ns 36513062 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1032394 ns 1026355 ns 1.01
batchedmm(512, Bsize=4)/zygote/GPU/oneAPI 78448314.5 ns
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU 1157393 ns 1162657.5 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 958 ns 958 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 958 ns 958 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1000 ns 1041 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 958 ns 1000 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 20220 ns 20341 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI 2011379 ns
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU 207432 ns 206602 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3667 ns 3708 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3667 ns 3666 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3750 ns 3750 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3625 ns 3709 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 242662 ns 243936 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI 11613706.5 ns
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 625907 ns 622497 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7229.5 ns 8125 ns 0.89
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8208 ns 8145.5 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9063 ns 10209 ns 0.89
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7833 ns 7645.5 ns 1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 110132.5 ns 110001.5 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 3376276 ns
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 72491 ns 64821 ns 1.12
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11792 ns 11417 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11708 ns 12146 ns 0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 12833 ns 12625 ns 1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 12042 ns 12083 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 533463.5 ns 533401.5 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 22224767.5 ns
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 357164 ns 351113 ns 1.02
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 250 ns 291 ns 0.86
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 333 ns 292 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 292 ns 333 ns 0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 20014 ns 20031 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI 2044805 ns
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU 46611 ns 47010 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 3167 ns 2875 ns 1.10
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2875 ns 2917 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3083 ns 3125 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2834 ns 3042 ns 0.93
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 168487.5 ns 139419 ns 1.21
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI 9185467 ns
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU 163482 ns 160172 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11500 ns 11708 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11292 ns 11208 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 13562.5 ns 12917 ns 1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9687.5 ns 11708 ns 0.83
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 110742.5 ns 52993 ns 2.09
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 3318937 ns
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 234383 ns 232812 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 22041.5 ns 20666.5 ns 1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 21312.5 ns 20208 ns 1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 22292 ns 22458 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 21375.5 ns 21187.5 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 445412.5 ns 249123.5 ns 1.79
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 20307385 ns
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 648033 ns 648996.5 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4375 ns 4375 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4417 ns 4375 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4417 ns 4458 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4375 ns 4417 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 21103 ns 20585 ns 1.03
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI 2254531 ns
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU 47271 ns 48820 ns 0.97
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16542 ns 16375 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16458 ns 16250 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16667 ns 16458 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16542 ns 16208 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 292441 ns 169722 ns 1.72
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI 12584045 ns
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU 206702.5 ns 209702 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 2041 ns 1958 ns 1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 2042 ns 1958 ns 1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 2083 ns 2084 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 1916 ns 2042 ns 0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 27885 ns 28203 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 1248055 ns
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 203262 ns 202342 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 16833 ns 17125 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 18250 ns 16791.5 ns 1.09
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 18125 ns 17542 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 17667 ns 17209 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 178504 ns 147741 ns 1.21
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 21525405 ns
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 684992 ns 682312 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 59104 ns 59062 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 65041 ns 62416 ns 1.04
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 66583.5 ns 61312.5 ns 1.09
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 51125 ns 53875 ns 0.95
batchedmm(16, Bsize=512)/forward/GPU/CUDA 71334 ns 71192 ns 1.00
batchedmm(16, Bsize=512)/forward/GPU/oneAPI 89279199 ns
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU 118362 ns 116711 ns 1.01
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 163062.5 ns 202750.5 ns 0.80
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 151271 ns 98750 ns 1.53
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 157250 ns 118104 ns 1.33
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 313146 ns 297958 ns 1.05
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 195169 ns 170047 ns 1.15
batchedmm(16, Bsize=512)/zygote/GPU/oneAPI 151578490.5 ns
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU 624817 ns 616606 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 82145.5 ns 84208 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 82749.5 ns 83646 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 86667 ns 85166 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 85000 ns 128334 ns 0.66
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 186525 ns 184384 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5756836 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 205352 ns 203702 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1808020.5 ns 1889375 ns 0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1915916.5 ns 1916750 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1905270.5 ns 1919083 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1911375 ns 1899041 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 475978 ns 379904 ns 1.25
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 27045542 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1069182 ns 1068311 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 291 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 291 ns 291 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 291 ns 292 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 18638 ns 18502 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI 2108817.5 ns
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU 42830 ns 41550.5 ns 1.03
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1792 ns 1750 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1792 ns 1791 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1833 ns 1833 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1791 ns 1834 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 225657 ns 145894.5 ns 1.55
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI 9833710 ns
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU 182527.5 ns 181622 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 8375 ns 8458 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 9125 ns 8937.5 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 11083 ns 11208.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 8041 ns 8875 ns 0.91
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 108185.5 ns 51415 ns 2.10
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 3365841 ns
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 232582 ns 232043 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10167 ns 9125 ns 1.11
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9542 ns 8667 ns 1.10
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10417 ns 10458.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9291 ns 9583 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 420282.5 ns 241818.5 ns 1.74
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 20467429 ns
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 629687 ns 623402 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57916 ns 58604.5 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46583 ns 39333 ns 1.18
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46458 ns 39792 ns 1.17
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83583 ns 83417 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 32500 ns 32658 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1374457 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 72281 ns 79585.5 ns 0.91
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1911000 ns 1931459 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1970187.5 ns 1973750 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1937771 ns 1980958.5 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1899667 ns 1884875 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 176646 ns 152863 ns 1.16
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 33503348 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1152023 ns 1040311 ns 1.11
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 418084 ns 418333 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 417375 ns 418709 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 427542 ns 422000 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 420250 ns 418583.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 173254.5 ns 94366 ns 1.84
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7736703 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 280773 ns 281763 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 671833.5 ns 673562.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 766666.5 ns 753812.5 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 684542 ns 769958 ns 0.89
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 731041.5 ns 751938 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 887279 ns 470483 ns 1.89
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 46741128 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 905534.5 ns 903129 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 3464375 ns 3419645.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 3437833 ns 3437875 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 3397500 ns 3451375 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 3449958 ns 3429042 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 148014 ns 140481 ns 1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8945738 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 441160 ns 441684 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 6193666.5 ns 6220250 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 6178645.5 ns 6224937 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 6207958 ns 6214292 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 6230917 ns 6141041.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 821729 ns 620637 ns 1.32
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 51511265 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1636158 ns 1629761.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 473083.5 ns 474958 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 342041.5 ns 253000 ns 1.35
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 341500 ns 253292 ns 1.35
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 902375 ns 901709 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 42882 ns 43146 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI 400566 ns
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU 241152 ns 241942.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2324750 ns 2271000 ns 1.02
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 2038541.5 ns 1763792 ns 1.16
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 2032354 ns 1760167 ns 1.15
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3197000 ns 3188958 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 202331 ns 200260 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI 12642725 ns
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 763338 ns 764328 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57520.5 ns 58125 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46395.5 ns 39334 ns 1.18
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 45959 ns 39750 ns 1.16
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83250 ns 83375 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 23227 ns 23268 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1432334 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 75651 ns 74721 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2029625 ns 2035750 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2079979 ns 2088417 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2070791 ns 2090333 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2000354 ns 1963541 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 191580 ns 155158 ns 1.23
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 35959863 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1041881.5 ns 1195637.5 ns 0.87
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57291 ns 58625 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46645.5 ns 39834 ns 1.17
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46625 ns 40083 ns 1.16
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83334 ns 83042 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 40746 ns 41354 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 810264 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 80396 ns 77975.5 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1890166 ns 1927125 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1976042 ns 1971541.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1971667 ns 1976833 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1895583 ns 1885312.5 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 198522 ns 164726 ns 1.21
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 17337732 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 936080 ns 1051246 ns 0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 291 ns 291 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 291 ns 333 ns 0.87
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 416 ns 416 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 333 ns 292 ns 1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 25307.5 ns 26436 ns 0.96
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 1259521 ns
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 46650 ns 46511 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6562.5 ns 7333 ns 0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6917 ns 6500 ns 1.06
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7292 ns 6917 ns 1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6834 ns 7834 ns 0.87
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 168328.5 ns 132779 ns 1.27
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 20601648 ns
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 371864 ns 364088.5 ns 1.02
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 291 ns 291 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 291 ns 292 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 291 ns 292 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 30302 ns 30026 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI 1177600.5 ns
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU 37815.5 ns 40500 ns 0.93
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 3542 ns 3250 ns 1.09
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2833 ns 2958 ns 0.96
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 3250 ns 3042 ns 1.07
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2959 ns 2792 ns 1.06
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 169119 ns 139460 ns 1.21
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI 7614831 ns
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU 152811 ns 156362 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 450021 ns 453562 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 441041 ns 426854 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 425041.5 ns 424771 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 422292 ns 454396.5 ns 0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 130746.5 ns 128743 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6115924 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 366698.5 ns 374513 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3801375 ns 3812646 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3799958 ns 3818687.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3805000 ns 3824687.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3829062.5 ns 3809020.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 640512 ns 467612 ns 1.37
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 35444962 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1468321 ns 1414714 ns 1.04
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 49831750 ns 49937813 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 35529708 ns 25988125 ns 1.37
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 35490875 ns 26009646 ns 1.36
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 97095125 ns 97113375 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1612269 ns 1610536 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/oneAPI 56680008 ns
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU 1041171 ns 1049471 ns 0.99
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 154466500.5 ns 154792729.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 112376375 ns 89048958.5 ns 1.26
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 112311958 ns 89207416 ns 1.26
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 295244375 ns 294786708.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6476168 ns 6494841 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/oneAPI 174388525 ns
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU 5549710 ns 5562936 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 16979 ns 18916.5 ns 0.90
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 19562.5 ns 15584 ns 1.26
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 17188 ns 14667 ns 1.17
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 15020.5 ns 15896 ns 0.94
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 14071 ns 13971 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI 1254861 ns
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU 25910 ns 27630 ns 0.94
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 10520.5 ns 11291 ns 0.93
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 8709 ns 7458.5 ns 1.17
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 8917 ns 7750 ns 1.15
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 17479 ns 17520.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 209068 ns 101782 ns 2.05
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI 10230351.5 ns
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU 148622 ns 148192 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7750 ns 9541.5 ns 0.81
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7854.5 ns 9125.5 ns 0.86
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 10334 ns 10333 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7583 ns 8542 ns 0.89
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 111568.5 ns 53666.5 ns 2.08
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 3718095.5 ns
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 237553 ns 235372 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11541.5 ns 9541 ns 1.21
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9687.5 ns 10209 ns 0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10708 ns 10458 ns 1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10709 ns 10250 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 501739 ns 269358 ns 1.86
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 23065545 ns
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 655677 ns 652326 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 8770.5 ns 9812.5 ns 0.89
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9750 ns 9250 ns 1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 10583 ns 10812.5 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 8750 ns 9562.5 ns 0.92
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 53968 ns 53391 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 3498205 ns
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 72631 ns 71711 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13459 ns 14333 ns 0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 15479 ns 14083 ns 1.10
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 19209 ns 15167 ns 1.27
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 14125 ns 16625 ns 0.85
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 250540 ns 251184.5 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 20620278 ns
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 346043 ns 344093 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 459 ns 458 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 500 ns 458 ns 1.09
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 583 ns 583 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 458 ns 583 ns 0.79
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 26861 ns 27208 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 1254571 ns
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 204762 ns 203792 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7208.5 ns 8625 ns 0.84
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9000 ns 8125 ns 1.11
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9125 ns 8604.5 ns 1.06
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8166 ns 8416.5 ns 0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 147122.5 ns 147255 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 22634021 ns
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 659287 ns 656126 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 15416 ns 16625 ns 0.93
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 16625 ns 14500 ns 1.15
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 14917 ns 13354 ns 1.12
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 11291 ns 10229 ns 1.10
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 13973 ns 13896.5 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI 1108916 ns
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU 186562 ns 186472 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 32000 ns 31750 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 32000 ns 32000 ns 1
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 31958 ns 32042 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 32167 ns 31833 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 109160 ns 110682.5 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI 11487029 ns
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 588817 ns 592116 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 492875 ns 450209 ns 1.09
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 442125 ns 445500 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 444958 ns 444167 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 440604 ns 462958 ns 0.95
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 188096.5 ns 188096.5 ns 1
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5891615 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 369779 ns 367068.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3834584 ns 3834209 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3827292 ns 3836666 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3817250 ns 3847459 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3836104.5 ns 3828250 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 382999 ns 383846 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 28452071 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1355634 ns 1358354 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 831622791.5 ns 784152667 ns 1.06
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 544951167 ns 416079687.5 ns 1.31
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 544430500 ns 422584917 ns 1.29
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 1552948271 ns 1509956229 ns 1.03
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22763244.5 ns 22771101.5 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/oneAPI 185795205 ns
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU 15420059 ns 14743999 ns 1.05
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 3888050458 ns 2524849666 ns 1.54
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 3211667750 ns 1511960000 ns 2.12
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1819585250 ns 1536159417 ns 1.18
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 4769468292 ns 4778947333 ns 1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 118595684 ns 119521542 ns 0.99
batchedmm(512, Bsize=512)/zygote/GPU/oneAPI 1039230192 ns
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU 88183228 ns 87915389 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 75333.5 ns 78208.5 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 77458 ns 80271 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 78584 ns 82708 ns 0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 76292 ns 77334 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 93335 ns 93705 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 6083372 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 120232 ns 118801 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 279333 ns 291334 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 194937.5 ns 210333 ns 0.93
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 234771 ns 261874.5 ns 0.90
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 194125 ns 202208.5 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 451188 ns 458544 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 46239896 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 657366.5 ns 662017 ns 0.99
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 199509499.5 ns 200217604 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 139162834 ns 103846750 ns 1.34
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 138977625 ns 104247042 ns 1.33
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 388989959 ns 389363833 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5833602 ns 5840254.5 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/oneAPI 79568180.5 ns
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU 3573358 ns 3591326 ns 0.99
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 619161479.5 ns 620550500 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 440796833 ns 352840416.5 ns 1.25
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 439294646 ns 353679646 ns 1.24
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 1189363000 ns 1181355417 ns 1.01
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 26219564 ns 26562043 ns 0.99
batchedmm(512, Bsize=128)/zygote/GPU/oneAPI 283162239 ns
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU 21927537.5 ns 22008202.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7417 ns 7167 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6083.5 ns 5292 ns 1.15
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6208 ns 5458 ns 1.14
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10042 ns 10000 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 21654 ns 20844 ns 1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1302067 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 48161 ns 48671 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212583 ns 245770.5 ns 0.86
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 228396 ns 243083 ns 0.94
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 222250 ns 221208 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213166.5 ns 207979 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 136607.5 ns 137816.5 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 29564519.5 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 524845 ns 523805 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 9833.5 ns 8334 ns 1.18
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 7979 ns 8166.5 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10250 ns 11041 ns 0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 7978.5 ns 9020.5 ns 0.88
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 51011 ns 50777 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 3317085 ns
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 69811 ns 69381 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7333.5 ns 8875 ns 0.83
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9625 ns 8583 ns 1.12
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 13562.5 ns 8166 ns 1.66
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8250 ns 10854.5 ns 0.76
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 242632 ns 245858 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 19151322 ns
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 316738.5 ns 312998.5 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 416 ns 500 ns 0.83
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 708 ns 500 ns 1.42
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 500 ns 625 ns 0.80
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 500 ns 584 ns 0.86
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 19752 ns 19411 ns 1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 1203125.5 ns
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 46481 ns 48630 ns 0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 8625 ns 10333 ns 0.83
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 10249.5 ns 11375 ns 0.90
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9500 ns 9770.5 ns 0.97
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 10292 ns 9708 ns 1.06
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 119755.5 ns 120697 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 25677237 ns
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 388684 ns 388289 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 105959 ns 105500 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 98500 ns 85875 ns 1.15
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 101021 ns 87000 ns 1.16
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 146271 ns 146333.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 16996 ns 16870 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI 756914 ns
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU 190327 ns 190057 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 478333 ns 478500 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 509583 ns 485458 ns 1.05
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 478459 ns 481521 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 478458.5 ns 478833 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 113991 ns 117100 ns 0.97
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI 12514796 ns
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 604977 ns 608201.5 ns 0.99
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 5375 ns 5959 ns 0.90
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 5333 ns 6625 ns 0.80
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 7208 ns 7479.5 ns 0.96
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 6729 ns 6229.5 ns 1.08
batchedmm(16, Bsize=32)/forward/GPU/CUDA 15434 ns 14736 ns 1.05
batchedmm(16, Bsize=32)/forward/GPU/oneAPI 73679048 ns
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU 79381 ns 79970 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 12375 ns 13500 ns 0.92
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 11000 ns 9750 ns 1.13
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 10875 ns 10167 ns 1.07
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 16625 ns 17125 ns 0.97
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 108121 ns 109548 ns 0.99
batchedmm(16, Bsize=32)/zygote/GPU/oneAPI 100453387 ns
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU 364504 ns 366884 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 39375 ns 40458 ns 0.97
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 51917 ns 50417 ns 1.03
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 52770.5 ns 51354 ns 1.03
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 13604 ns 13667 ns 1.00
batchedmm(16, Bsize=128)/forward/GPU/CUDA 20011 ns 20278.5 ns 0.99
batchedmm(16, Bsize=128)/forward/GPU/oneAPI 79258230 ns
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU 85481 ns 85591 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 36271 ns 37250 ns 0.97
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 35313 ns 29541 ns 1.20
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 31291.5 ns 29875 ns 1.05
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 57750 ns 57562.5 ns 1.00
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 121997.5 ns 119274.5 ns 1.02
batchedmm(16, Bsize=128)/zygote/GPU/oneAPI 113144013 ns
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU 410244.5 ns 395964 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 1584 ns 1833 ns 0.86
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 1750 ns 1667 ns 1.05
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 2250 ns 2291 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 1687.5 ns 2041.5 ns 0.83
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 13818 ns 13524 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI 1224877 ns
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU 32640 ns 32690 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 2166 ns 2167 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 2292 ns 2145.5 ns 1.07
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 2417 ns 2395.5 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 2250 ns 2312.5 ns 0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 89827 ns 89460.5 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI 9149897 ns
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU 136461 ns 136351 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5666.5 ns 6104 ns 0.93
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4896 ns 4708.5 ns 1.04
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6333.5 ns 6187.5 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5375 ns 5874.5 ns 0.91
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 59437.5 ns 58659.5 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 5810721.5 ns
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 68755.5 ns 67281 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8167 ns 9083.5 ns 0.90
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8542 ns 9000 ns 0.95
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8500 ns 8709 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9042 ns 8750 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 383098.5 ns 386636 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 38586019 ns
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 387674 ns 384884 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56708 ns 56916 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 57666 ns 56833 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 57625 ns 56958 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 58250 ns 58291 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 30235 ns 29539 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1254024 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 204092 ns 203102.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 448000 ns 453791.5 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 472083.5 ns 466875 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 465125 ns 465666.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 436541.5 ns 436208 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 170026 ns 167893 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 28109365.5 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 826388 ns 823238 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 3312500 ns 3327646 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 2340084 ns 1773958 ns 1.32
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 2339583.5 ns 1770208 ns 1.32
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 6318792 ns 6318167 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA 204725 ns 203665 ns 1.01
batchedmm(128, Bsize=128)/forward/GPU/oneAPI 83409682 ns
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU 240632 ns 213597.5 ns 1.13
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 11441604 ns 11522375 ns 0.99
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 8301208 ns 6550792 ns 1.27
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 8329792 ns 6579708.5 ns 1.27
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 21184729.5 ns 21256687.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 760406.5 ns 761872 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/oneAPI 125395684.5 ns
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU 1063686 ns 1057191 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5666 ns 6667 ns 0.85
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5604.5 ns 4917 ns 1.14
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6438 ns 7000 ns 0.92
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6312.5 ns 5166 ns 1.22
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 57453 ns 57961.5 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 5296827 ns
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 56241 ns 56041 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7125 ns 11458 ns 0.62
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7333 ns 8750 ns 0.84
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7250 ns 7541 ns 0.96
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8292 ns 8625 ns 0.96
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 367190 ns 382208 ns 0.96
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 35508394 ns
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 362159 ns 361754 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 140708 ns 126917 ns 1.11
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 123917 ns 102541 ns 1.21
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 100667 ns 101792 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 104958 ns 98333 ns 1.07
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 127546.5 ns 127201 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6179687.5 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 206197 ns 206327 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1992625 ns 2039750.5 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2016083.5 ns 2028645.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2019875 ns 2040937.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2026687.5 ns 1948458 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 432468 ns 443232 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 33529611.5 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1184812.5 ns 1211817 ns 0.98
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 32208.5 ns 33542 ns 0.96
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 37167 ns 34416 ns 1.08
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 35833 ns 34583 ns 1.04
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 583 ns 625 ns 0.93
batchedmm(2, Bsize=4)/forward/GPU/CUDA 13995 ns 13510 ns 1.04
batchedmm(2, Bsize=4)/forward/GPU/oneAPI 74212471 ns
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU 79370 ns 79871 ns 0.99
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2645.5 ns 3750 ns 0.71
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 2750 ns 3209 ns 0.86
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 3020.5 ns 3041 ns 0.99
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2333 ns 2333 ns 1
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 92140 ns 89708.5 ns 1.03
batchedmm(2, Bsize=4)/zygote/GPU/oneAPI 94219800 ns
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU 341683 ns 340203 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7167 ns 7209 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6000 ns 5292 ns 1.13
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6083 ns 5417 ns 1.12
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10000 ns 10042 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 29283 ns 29375 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1222925.5 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 48131 ns 49300 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 248271 ns 222374.5 ns 1.12
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221125 ns 221270.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221042 ns 221458 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 216625 ns 206500 ns 1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 158765.5 ns 159760 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26714332 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 569975.5 ns 572920.5 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3959 ns 3958 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 4000 ns 3917 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3958 ns 3958 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3917 ns 3958 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 18821 ns 18490 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI 2189549 ns
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU 41970 ns 43450 ns 0.97
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14958 ns 14667 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15125 ns 14666 ns 1.03
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14917 ns 14709 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14666 ns 14708 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 163909.5 ns 165588 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI 11534435 ns
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU 192582 ns 197842 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 146416 ns 130708 ns 1.12
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 103750 ns 101313 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 103791 ns 105000.5 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 100208 ns 106666.5 ns 0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 127104 ns 125911 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6092161.5 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 207422.5 ns 204662 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1791000 ns 1925042 ns 0.93
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1909958 ns 1928041 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1910875 ns 1930583 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1922250 ns 1855291 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 418877 ns 429902 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 29586225 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1089381 ns 1148786.5 ns 0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17291 ns 18166 ns 0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 22583 ns 18979 ns 1.19
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21062.5 ns 22458 ns 0.94
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17417 ns 18125 ns 0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 61422.5 ns 63187.5 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3492174 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 80420 ns 79155.5 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 216354.5 ns 252792 ns 0.86
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 256145.5 ns 261875 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 216500 ns 219958 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 219146 ns 217125 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 272581 ns 279978 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 19535498.5 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 477435 ns 475684 ns 1.00
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 26813 ns 24729.5 ns 1.08
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 31333 ns 28125 ns 1.11
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 28812.5 ns 27000 ns 1.07
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 1312 ns 1375 ns 0.95
batchedmm(16, Bsize=4)/forward/GPU/CUDA 14764 ns 13843 ns 1.07
batchedmm(16, Bsize=4)/forward/GPU/oneAPI 75193108 ns
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU 81295.5 ns 81051 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 5000 ns 5479.5 ns 0.91
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 4833.5 ns 5167 ns 0.94
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 5083.5 ns 5270.5 ns 0.96
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 4854 ns 4708 ns 1.03
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 110219 ns 110586.5 ns 1.00
batchedmm(16, Bsize=4)/zygote/GPU/oneAPI 96235287 ns
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU 380423.5 ns 379244 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 304917 ns 308792 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 306417 ns 305625 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 307500 ns 307291 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 306312 ns 306834 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 96559 ns 102299 ns 0.94
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 8040746 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 273553 ns 272803 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 534959 ns 544417 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 578875 ns 575000 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 532250 ns 545958.5 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 532292 ns 538167 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 478700.5 ns 500049 ns 0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 45273096.5 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 854594 ns 849309 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18833 ns 22000 ns 0.86
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 21500 ns 21083 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21500 ns 22042 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18729 ns 19667 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 61059 ns 64471.5 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3648054 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 79701 ns 78011 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 225292 ns 226000 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 215459 ns 245604 ns 0.88
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 214416 ns 215584 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 215625 ns 212791 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 315781.5 ns 344357 ns 0.92
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 25685453.5 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 536640.5 ns 535535 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6875 ns 7542 ns 0.91
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6729 ns 5791.5 ns 1.16
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7875.5 ns 8416 ns 0.94
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6187 ns 7167 ns 0.86
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 59473 ns 63232 ns 0.94
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 5742399 ns
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 65660 ns 65391 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9875 ns 13667 ns 0.72
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10541.5 ns 11916 ns 0.88
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10542 ns 10125 ns 1.04
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11395.5 ns 10041 ns 1.13
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 375474.5 ns 396144.5 ns 0.95
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 37560344 ns
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 385404 ns 386814 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4958 ns 6541.5 ns 0.76
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5792 ns 4666 ns 1.24
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6937 ns 6500 ns 1.07
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4813 ns 7042 ns 0.68
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 59336 ns 64824 ns 0.92
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 5881412.5 ns
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 66901 ns 68750 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7333 ns 8083 ns 0.91
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7292 ns 8166 ns 0.89
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7625 ns 7708 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7917 ns 7583 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 400389 ns 423945 ns 0.94
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 41438719.5 ns
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 390804 ns 394914 ns 0.99
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 14514708 ns 14516708 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 10142334 ns 7713187.5 ns 1.31
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 10128041 ns 7704854 ns 1.31
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 27891250 ns 27801334 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 532579.5 ns 531151.5 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/oneAPI 99192089 ns
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU 394344 ns 393889 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 46256625 ns 46558771.5 ns 0.99
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 33475978.5 ns 26529584 ns 1.26
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 33502666 ns 26598312 ns 1.26
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 85530791 ns 85686792 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 3411776.5 ns 3208907 ns 1.06
batchedmm(128, Bsize=512)/zygote/GPU/oneAPI 197868624 ns
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU 3281874 ns 3300533 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 66792 ns 67833 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 67791 ns 65625 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 69583 ns 69333.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 66542 ns 67292 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 63585 ns 68650 ns 0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3635639 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 238943 ns 232393 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 482792 ns 450333 ns 1.07
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 490208.5 ns 453834 ns 1.08
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 443416 ns 446417 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 443250 ns 441584 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 333625.5 ns 394734 ns 0.85
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 27824814.5 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 796928.5 ns 788457.5 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns 625 ns 0.80
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 584 ns 542 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 584 ns 625 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 541 ns 583 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 26261 ns 26112 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 1201042 ns
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 46591 ns 47140 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9624.5 ns 10542 ns 0.91
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9458 ns 9583 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9042 ns 9250 ns 0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 15042 ns 10708 ns 1.40
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 154087.5 ns 152524.5 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 22365155 ns
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 376374 ns 373324 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 9792 ns 9833 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 9875 ns 9792 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 9834 ns 9833 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 9792 ns 9833 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 21245 ns 20835 ns 1.02
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI 2109275 ns
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU 207407 ns 208092 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 45834 ns 46333 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 46083 ns 45833 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 48125 ns 46000 ns 1.05
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 46000 ns 45959 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 181985 ns 189222 ns 0.96
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI 12501764 ns
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 599026 ns 603691 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56292 ns 56334 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 57208 ns 56375 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 57083 ns 56458 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 57875 ns 57875 ns 1
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 22599 ns 21828 ns 1.04
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1231700.5 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 210062.5 ns 202032 ns 1.04
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 491041.5 ns 464834 ns 1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 503250 ns 474250.5 ns 1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 465875 ns 465771 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 440959 ns 434770.5 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 153666 ns 162400 ns 0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 33436886 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 880644 ns 877129 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 646396 ns 651104.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 656479 ns 683542 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 592854.5 ns 656292 ns 0.90
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 616145.5 ns 616541.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 128259.5 ns 140209 ns 0.91
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8403444.5 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 302363 ns 305778 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2232145.5 ns 2262562.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2230708 ns 2231521 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2231875 ns 2245125 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2259375 ns 2244604.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 617840 ns 644538 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 50658009 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1318863 ns 1307248 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 22542 ns 21625 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19458 ns 20833 ns 0.93
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 22583 ns 23208 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 19458 ns 20125 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 64266 ns 69407.5 ns 0.93
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3671624.5 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 79151 ns 78811 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 224000 ns 233042 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 254083 ns 233125 ns 1.09
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221708 ns 221333 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 220750 ns 224875 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 347077 ns 410361 ns 0.85
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 25817148 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 554175.5 ns 557581 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 541 ns 625 ns 0.87
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 583 ns 500 ns 1.17
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 584 ns 625 ns 0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 500 ns 625 ns 0.80
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 18626 ns 18190 ns 1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 1230354 ns
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 48171 ns 47870 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 8917 ns 9812.5 ns 0.91
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9875 ns 9250 ns 1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9812.5 ns 10042 ns 0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9270.5 ns 10000 ns 0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 136039.5 ns 136633 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 29131754 ns
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 399044 ns 397114 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 8042 ns 8958 ns 0.90
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 9312.5 ns 8438 ns 1.10
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 11292 ns 10750 ns 1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 8334 ns 8084 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 57821.5 ns 64696.5 ns 0.89
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 3436196.5 ns
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 69710.5 ns 71891 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7125 ns 7666 ns 0.93
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7791 ns 7250 ns 1.07
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7875 ns 8417 ns 0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7417 ns 7708 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 255977.5 ns 292900 ns 0.87
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 18497059 ns
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 319743 ns 318078 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1375 ns 1542 ns 0.89
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1645.5 ns 1458 ns 1.13
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1917 ns 2208 ns 0.87
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1500 ns 1708 ns 0.88
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 13693.5 ns 13397 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI 1186814 ns
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU 188882 ns 188372 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3291 ns 3312.5 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3479.5 ns 3375 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3625 ns 3667 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3291 ns 3375 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 104102.5 ns 117821 ns 0.88
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI 10382640 ns
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 575736 ns 578906 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 146979 ns 147437.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 129042 ns 106312.5 ns 1.21
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 129875 ns 107750 ns 1.21
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 226000 ns 226021 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 17312 ns 16777 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI 1216751.5 ns
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU 39935.5 ns 40540 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 159771 ns 163417 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 110521 ns 106833 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 136250 ns 98125 ns 1.39
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 251666.5 ns 251458 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 118480.5 ns 141681 ns 0.84
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI 10669966 ns
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU 265838 ns 266553 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7292 ns 7292 ns 1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6041 ns 5333 ns 1.13
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6042 ns 5375 ns 1.12
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10250 ns 10209 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 26774 ns 26669.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1208039.5 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 48681 ns 48681 ns 1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 219937.5 ns 256208 ns 0.86
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 227375 ns 258709 ns 0.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228667 ns 231395.5 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 212729.5 ns 224896 ns 0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 177762.5 ns 185868.5 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 28372083 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 589856 ns 589590.5 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 15958 ns 16125 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 16208.5 ns 14750 ns 1.10
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 16687.5 ns 17000 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 14792 ns 15375 ns 0.96
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 63622 ns 76403.5 ns 0.83
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 5760147.5 ns
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 227543 ns 230202 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 23916 ns 24416 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24500 ns 23708 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 23458 ns 23792 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 23000 ns 23417 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 431176 ns 496390.5 ns 0.87
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 42796325.5 ns
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 675657 ns 676296.5 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 9167 ns 10334 ns 0.89
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9834 ns 9375 ns 1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 11021 ns 11666.5 ns 0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8729.5 ns 9292 ns 0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 64004 ns 81566 ns 0.78
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 3525023 ns
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 73491 ns 72771 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14292 ns 14333 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 13729 ns 13666.5 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14208 ns 14729.5 ns 0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13459 ns 14750 ns 0.91
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 323073 ns 412717 ns 0.78
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 21480471 ns
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 371404 ns 362433 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 8083 ns 8917 ns 0.91
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10416.5 ns 9750 ns 1.07
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 10937.5 ns 11896 ns 0.92
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9333 ns 9542 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 66250 ns 84716 ns 0.78
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 3712952 ns
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 74871 ns 71721 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12708 ns 13250 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13020.5 ns 12521 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13333.5 ns 13542 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12417 ns 12875 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 286792 ns 346105 ns 0.83
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 19725639 ns
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 340593.5 ns 338603.5 ns 1.01
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 29166 ns 31041.5 ns 0.94
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 34604 ns 32438 ns 1.07
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 32229.5 ns 29625 ns 1.09
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 1750 ns 2167 ns 0.81
batchedmm(2, Bsize=128)/forward/GPU/CUDA 15001 ns 14504 ns 1.03
batchedmm(2, Bsize=128)/forward/GPU/oneAPI 78965877 ns
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU 86890 ns 80601 ns 1.08
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 5125 ns 5250 ns 0.98
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 5062.5 ns 4750 ns 1.07
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 5167 ns 5208 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 6292 ns 6541 ns 0.96
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 99425.5 ns 107471 ns 0.93
batchedmm(2, Bsize=128)/zygote/GPU/oneAPI 110379934 ns
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU 383544 ns 370164 ns 1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 334 ns 291 ns 1.15
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 19905 ns 18911 ns 1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 1150337 ns
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 48921 ns 46920 ns 1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6292 ns 6542 ns 0.96
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6458 ns 6292 ns 1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6708 ns 6958.5 ns 0.96
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6208 ns 6708 ns 0.93
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 127212 ns 135126 ns 0.94
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 23911059 ns
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 394834 ns 386254 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 1958 ns 2000 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 2041 ns 1958 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 2042 ns 2083 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 1958 ns 2042 ns 0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 20510 ns 20048 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 1241051 ns
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 210527 ns 204122 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 16896 ns 16937.5 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17125 ns 17042 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16750 ns 17000 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15875 ns 15875 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 143434 ns 151188.5 ns 0.95
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 25814251 ns
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 704697 ns 698796.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 174541 ns 150292 ns 1.16
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 147000 ns 188375 ns 0.78
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 152688 ns 152834 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 150916 ns 152750 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 165022.5 ns 169794 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7825451.5 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 226202.5 ns 225092 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1318770.5 ns 1328166 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1320292 ns 1339625 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1329500 ns 1339979 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1333791.5 ns 1321375 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 605687 ns 738732.5 ns 0.82
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 46439481 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1061786 ns 1067311 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 25084 ns 26042 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 25042 ns 25313 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 27854 ns 28208 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24749.5 ns 25750 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 119756.5 ns 179072 ns 0.67
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7727314 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 116991 ns 113981 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 131479 ns 181083.5 ns 0.73
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 171708 ns 169917 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 127521 ns 118875 ns 1.07
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 117479 ns 125563 ns 0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 551436.5 ns 736737.5 ns 0.75
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 45901726 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 610436 ns 606996 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 291 ns 375 ns 0.78
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 333 ns 1.13
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 17730.5 ns 17782 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 1203450 ns
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 48751 ns 47020 ns 1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6416.5 ns 6917 ns 0.93
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6542 ns 6500 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6833 ns 7270.5 ns 0.94
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6167 ns 6958 ns 0.89
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 136341 ns 149426 ns 0.91
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 25006525.5 ns
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 393949 ns 389994 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6666 ns 6209 ns 1.07
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6625 ns 5708 ns 1.16
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6917 ns 7666 ns 0.90
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5666 ns 5958 ns 0.95
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 72501 ns 100369 ns 0.72
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 5996889 ns
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 233483 ns 231643 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9917 ns 10083 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10062.5 ns 9666.5 ns 1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10250 ns 10333 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9875 ns 10125 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 493431 ns 656519 ns 0.75
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 41270237 ns
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 675326 ns 676037 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 666 ns 708 ns 0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 667 ns 667 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 667 ns 667 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 667 ns 667 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 20029 ns 20098 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI 2098576 ns
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU 207872.5 ns 205502 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4542 ns 4667 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4625 ns 4584 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4791 ns 4875 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4625 ns 4709 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 167220.5 ns 183686.5 ns 0.91
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI 9409031.5 ns
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 577916 ns 577406 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7854 ns 8062 ns 0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8875 ns 8083 ns 1.10
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9750 ns 10062 ns 0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7334 ns 7979.5 ns 0.92
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 72767.5 ns 112521 ns 0.65
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 3713250 ns
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 77435.5 ns 75781 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8167 ns 8625 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8354.5 ns 8750 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9041 ns 9459 ns 0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8417 ns 8959 ns 0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 372607.5 ns 542270 ns 0.69
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 21133871 ns
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 345814 ns 339298.5 ns 1.02
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 125395.5 ns 126979.5 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 129042 ns 100291 ns 1.29
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 129959 ns 97208 ns 1.34
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 180916 ns 180729.5 ns 1.00
batchedmm(128, Bsize=4)/forward/GPU/CUDA 44539 ns 44342 ns 1.00
batchedmm(128, Bsize=4)/forward/GPU/oneAPI 75228887 ns
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU 100291 ns 101011 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 310917 ns 340250 ns 0.91
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 313833 ns 192146 ns 1.63
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 324083.5 ns 167166 ns 1.94
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 600354 ns 573958.5 ns 1.05
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 150808 ns 199334 ns 0.76
batchedmm(128, Bsize=4)/zygote/GPU/oneAPI 91943409 ns
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU 502450 ns 515465 ns 0.97
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 396750 ns 399208 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288145.5 ns 215250 ns 1.34
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 287583 ns 215625 ns 1.33
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756625 ns 756875 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 40964 ns 40054 ns 1.02
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI 1391370 ns
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU 81511 ns 80551 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1449583.5 ns 1406459 ns 1.03
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1136667 ns 862312 ns 1.32
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1134771 ns 864000 ns 1.31
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2361041.5 ns 2359542 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 207930 ns 234952 ns 0.88
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI 10356148 ns
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU 355144 ns 353324 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 647292 ns 659917 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 578500 ns 658270.5 ns 0.88
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 639416 ns 624271 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 656333 ns 677791.5 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 154081 ns 196665.5 ns 0.78
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8771052.5 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 306073.5 ns 305543 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2453625 ns 2481875 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2424291 ns 2467479.5 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2442542 ns 2476313 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2470583 ns 2446833 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 767152.5 ns 984615.5 ns 0.78
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 52532777 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1399204.5 ns 1399689 ns 1.00
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 32604 ns 34062.5 ns 0.96
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 36937.5 ns 34666.5 ns 1.07
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 34542 ns 32791.5 ns 1.05
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 917 ns 958 ns 0.96
batchedmm(2, Bsize=32)/forward/GPU/CUDA 14042 ns 14044 ns 1.00
batchedmm(2, Bsize=32)/forward/GPU/oneAPI 78232811.5 ns
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU 79530 ns 84401 ns 0.94
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 3000 ns 3166.5 ns 0.95
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 3084 ns 3166 ns 0.97
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 3333 ns 3500 ns 0.95
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 3000 ns 3250 ns 0.92
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 100283 ns 121484 ns 0.83
batchedmm(2, Bsize=32)/zygote/GPU/oneAPI 96545751 ns
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU 337334 ns 362074 ns 0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 405958 ns 406584 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 408209 ns 402458 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 407958 ns 403000 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 421459 ns 420645.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 36148.5 ns 36583 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1554049.5 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 238757.5 ns 238852 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3868375 ns 3879583 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3988562.5 ns 3983541.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3992667 ns 3998250 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3776708.5 ns 3674250 ns 1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 193888 ns 237279.5 ns 0.82
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 37305285.5 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1433244 ns 1428125 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 32924.5 ns 32312 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI 1242082 ns
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU 37990 ns 38101 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15708 ns 15459 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15750 ns 15459 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15958 ns 15666 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15500 ns 15458 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 189381 ns 242437.5 ns 0.78
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI 9458441 ns
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU 169642 ns 167902 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 404708 ns 404458 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 295750 ns 221625 ns 1.33
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 295958 ns 221375 ns 1.34
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 761125 ns 760125 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 117898 ns 117928 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI 1045095 ns
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU 87241 ns 87841 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1478500 ns 1429417 ns 1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1159645.5 ns 887583 ns 1.31
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1158042 ns 887396 ns 1.30
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2384583 ns 2378208 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 189114 ns 192870.5 ns 0.98
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI 9516529.5 ns
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU 351188 ns 353053 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 500 ns 542 ns 0.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 583 ns 458 ns 1.27
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 583 ns 583 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 18799 ns 19335 ns 0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 1188091 ns
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 205912 ns 205012 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 7292 ns 7500 ns 0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 7583 ns 7250 ns 1.05
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 7917 ns 8166 ns 0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7375 ns 8167 ns 0.90
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 141427.5 ns 165885 ns 0.85
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 26708173 ns
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 683937 ns 683986 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 833042 ns 832729.5 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 621208 ns 467000 ns 1.33
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 621791 ns 469250 ns 1.33
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 1550917 ns 1575625 ns 0.98
batchedmm(128, Bsize=32)/forward/GPU/CUDA 134056.5 ns 129567 ns 1.03
batchedmm(128, Bsize=32)/forward/GPU/oneAPI 77301649 ns
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU 227902 ns 227872 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 2692167 ns 2691958.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1995500 ns 1537333 ns 1.30
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 2004812.5 ns 1540083.5 ns 1.30
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 4935000 ns 4938125 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 249781 ns 274489 ns 0.91
batchedmm(128, Bsize=32)/zygote/GPU/oneAPI 100408887.5 ns
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU 840633.5 ns 806443 ns 1.04
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 291 ns 375 ns 0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 334 ns 1.12
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 334 ns 0.87
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 25355 ns 25740 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 1311449 ns
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 46990 ns 47540 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6209 ns 6625 ns 0.94
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6708.5 ns 6125 ns 1.10
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6542 ns 6791 ns 0.96
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6125 ns 6667 ns 0.92
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 157347 ns 180518.5 ns 0.87
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 21691879 ns
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 365484 ns 359293 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2366042 ns 2375375 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2395500 ns 2422500 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2374083 ns 2407959 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2382167 ns 2370375 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 170643 ns 178233.5 ns 0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8487051 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 374764 ns 374734 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4646208 ns 4668709 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4643687 ns 4652084 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4660250 ns 4665083.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4569374.5 ns 4600917 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 714837 ns 872920 ns 0.82
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 50175411.5 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1351724 ns 1382244 ns 0.98
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 7208 ns 9437.5 ns 0.76
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 7084 ns 7833 ns 0.90
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7208 ns 7292 ns 0.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6833 ns 6834 ns 1.00
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 16063.5 ns 16361 ns 0.98
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI 1173405 ns
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU 39030 ns 39440 ns 0.99
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 63792 ns 74520.5 ns 0.86
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 32833 ns 49250 ns 0.67
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 45917 ns 51729 ns 0.89
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 45229.5 ns 49083.5 ns 0.92
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 163785 ns 212837 ns 0.77
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI 10469728.5 ns
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU 262653 ns 266233 ns 0.99
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 20584 ns 22250 ns 0.93
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 26208 ns 25000 ns 1.05
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 23542 ns 21854.5 ns 1.08
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 5125 ns 5375 ns 0.95
batchedmm(2, Bsize=512)/forward/GPU/CUDA 16017 ns 15953 ns 1.00
batchedmm(2, Bsize=512)/forward/GPU/oneAPI 90340662 ns
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU 84110 ns 83861 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 11791 ns 11834 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 10229.5 ns 9187.5 ns 1.11
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 10625 ns 9520.5 ns 1.12
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 17895.5 ns 18354.5 ns 0.97
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 159148 ns 203711.5 ns 0.78
batchedmm(2, Bsize=512)/zygote/GPU/oneAPI 149555538 ns
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU 367264 ns 388864 ns 0.94
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 406500 ns 406375 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 297583 ns 223500 ns 1.33
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 297250 ns 223792 ns 1.33
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 762791 ns 762958 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 43249 ns 43379 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI 1362482 ns
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU 87411 ns 89781 ns 0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1484125.5 ns 1427542 ns 1.04
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1167542 ns 892959 ns 1.31
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1161667 ns 892958 ns 1.30
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2387604.5 ns 2385625 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 213476 ns 239711 ns 0.89
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI 13925589 ns
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU 377604 ns 376923.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 433583 ns 434375 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 436917 ns 430000 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 436666 ns 430417 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 448291 ns 448375 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 45983 ns 46179 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1048211.5 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 234192 ns 235662 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3894625 ns 3912500 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4022709 ns 4004000 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4024624.5 ns 4025375.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3801916.5 ns 3768792 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 210260 ns 251012 ns 0.84
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 32692776 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1361238 ns 1368994 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 8709 ns 8750 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 7667 ns 6875 ns 1.12
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 7667 ns 6917 ns 1.11
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 12375 ns 12458 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 20402 ns 20602 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI 2188548.5 ns
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU 210772 ns 209952 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 45041 ns 44958 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 45208 ns 45083 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 45208 ns 45250 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 44708 ns 44750 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 253192 ns 314279 ns 0.81
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI 14008146.5 ns
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 653917 ns 653907 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 82979 ns 115896 ns 0.72
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 126104.5 ns 125812.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 86229.5 ns 126604.5 ns 0.68
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 84875 ns 89000 ns 0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 184626 ns 186375.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6066708 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 219662 ns 219802 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2017833 ns 2026583 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2016000 ns 2025000 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2006375 ns 2024729.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2025083 ns 2026520.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 496955.5 ns 566645 ns 0.88
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 27423881 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1040810 ns 1084851 ns 0.96

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.