Add numpy benchmarks to fft and blas. by drufat · Pull Request #80 · arrayfire/arrayfire-python

drufat · 2016-05-03T01:31:33Z

Compares the following functions

arrayfire	numpy
af.matmul	np.dot
af.fft2	np.fft.fft2

Also switch to timeit for benchmark measurements. timeit turns off the python garbage collection during measurements, so the results may be more accurate.

…measurements.

pavanky · 2016-05-03T06:24:02Z

@drufat calling af.sync() for each run on the GPU is going to significantly slow it down. In real world code you are only ever going to call af.sync() occasionally.

The proper way to benchmark it would be to call the functions inside run for a certain number of iterations and then call af.sync()

pavanky · 2016-05-03T06:26:00Z

Can you also change bench_device to bench_arrayfire because it makes more sense this way.

drufat · 2016-05-03T14:08:02Z

@pavanky My understanding is that the purpose of running multiple iterations in a benchmark is to reduce the variability in the result - the actual computation should not be modified and the comparison should still be based on the time it takes for a single run to complete. If af.sync() is required to do the computation for a single run, then it should be part of the benchmark as otherwise you would be unfairly favoring arrayfire by dividing some of its time over all the iteration.

Here is a comparison of the run times on my computer for the old version and the new version.

old bench_blas.py

ArrayFire v3.3.2 (CUDA, 64-bit Linux, build default)
Platform: CUDA Toolkit 7.5, Driver: 364.19
[0] GeForce GTX 960, 2040 MB, CUDA Compute 5.2
Benchmark N x N matrix multiply
Time taken for  128 x  128: 6.0806 Gflops
Time taken for  256 x  256: 1149.6282 Gflops
Time taken for  384 x  384: 1423.8280 Gflops
Time taken for  512 x  512: 2057.0017 Gflops
Time taken for  640 x  640: 1838.6790 Gflops
Time taken for  768 x  768: 1910.5981 Gflops
Time taken for  896 x  896: 2059.2579 Gflops
Time taken for 1024 x 1024: 2405.0432 Gflops
Time taken for 1152 x 1152: 2275.2714 Gflops
Time taken for 1280 x 1280: 2331.1159 Gflops
Time taken for 1408 x 1408: 2382.6808 Gflops
Time taken for 1536 x 1536: 2511.4524 Gflops
Time taken for 1664 x 1664: 2439.3697 Gflops
Time taken for 1792 x 1792: 2468.1887 Gflops
Time taken for 1920 x 1920: 2469.8352 Gflops
Time taken for 2048 x 2048: 2563.9184 Gflops

new bench_blas.py

ArrayFire v3.3.2 (CUDA, 64-bit Linux, build default)
Platform: CUDA Toolkit 7.5, Driver: 364.19
[0] GeForce GTX 960, 2040 MB, CUDA Compute 5.2
Benchmark N x N matrix multiply on arrayfire
Time taken for  128 x  128: 6.0006 Gflops
Time taken for  256 x  256: 609.4753 Gflops
Time taken for  384 x  384: 947.4411 Gflops
Time taken for  512 x  512: 1546.0062 Gflops
Time taken for  640 x  640: 1759.6338 Gflops
Time taken for  768 x  768: 1812.2530 Gflops
Time taken for  896 x  896: 1928.0725 Gflops
Time taken for 1024 x 1024: 2335.1287 Gflops
Time taken for 1152 x 1152: 2193.2056 Gflops
Time taken for 1280 x 1280: 2295.2198 Gflops
Time taken for 1408 x 1408: 2338.1797 Gflops
Time taken for 1536 x 1536: 2501.7196 Gflops
Time taken for 1664 x 1664: 2403.7006 Gflops
Time taken for 1792 x 1792: 2450.4298 Gflops
Time taken for 1920 x 1920: 2455.5042 Gflops
Time taken for 2048 x 2048: 2545.7770 Gflops
Benchmark N x N matrix multiply on numpy
Time taken for  128 x  128: 3.9541 Gflops
Time taken for  256 x  256: 4.1893 Gflops
Time taken for  384 x  384: 4.2883 Gflops
Time taken for  512 x  512: 4.3267 Gflops

pavanky · 2016-05-03T14:39:52Z

@drufat You don't need to call af.sync(). This is only required at the end of a benchmark.

If you are calling it in every iteration, it is going to increase the overhead. For blas, this is a smaller overhead (about 5-10% at smaller sizes), but it is going to be higher for FFT.

drufat · 2016-05-03T14:47:03Z

Here are the benchmarks for the fft

old bench_fft.py

ArrayFire v3.3.2 (CUDA, 64-bit Linux, build default)
Platform: CUDA Toolkit 7.5, Driver: 364.19
[0] GeForce GTX 960, 2040 MB, CUDA Compute 5.2
Benchmark N x N 2D fft
Time taken for  128 x  128: 0.4803 Gflops
Time taken for  256 x  256: 79.5998 Gflops
Time taken for  512 x  512: 95.8458 Gflops
Time taken for 1024 x 1024: 119.8499 Gflops
Time taken for 2048 x 2048: 134.8493 Gflops
Time taken for 4096 x 4096: 147.1597 Gflops

new bench_fft.py

ArrayFire v3.3.2 (CUDA, 64-bit Linux, build default)
Platform: CUDA Toolkit 7.5, Driver: 364.19
[0] GeForce GTX 960, 2040 MB, CUDA Compute 5.2
Benchmark N x N 2D fft on arrayfire
Time taken for  128 x  128: 0.1194 Gflops
Time taken for  256 x  256: 68.5797 Gflops
Time taken for  512 x  512: 82.1921 Gflops
Time taken for 1024 x 1024: 117.4545 Gflops
Time taken for 2048 x 2048: 134.9769 Gflops
Time taken for 4096 x 4096: 147.8535 Gflops
Benchmark N x N 2D fft on numpy
Time taken for  128 x  128: 3.4478 Gflops
Time taken for  256 x  256: 3.9301 Gflops
Time taken for  512 x  512: 2.9090 Gflops

pavanky · 2016-05-03T18:31:40Z

@drufat It's still significantly slower at smaller sizes. Besides, always calling sync gives the wrong impression about how things should be timed on the GPU.

…c() is run only once.

drufat · 2016-05-03T18:45:41Z

Alright. I switched the code back to the time module. Now af.sync() is only run once as per your request.

Add numpy benchmarks to fft and blas. Switch to timeit for benchmark …

a35bfaa

…measurements.

Use a more descriptive name for benchmark.

54a3785

Switch back to using time module for benchmarks to ensure that af.syn…

13d1302

…c() is run only once.

pavanky merged commit ed52b90 into arrayfire:devel May 3, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add numpy benchmarks to fft and blas. #80

Add numpy benchmarks to fft and blas. #80
pavanky merged 3 commits intoarrayfire:develfrom
drufat:numpy

drufat commented May 3, 2016

Uh oh!

pavanky commented May 3, 2016

Uh oh!

pavanky commented May 3, 2016

Uh oh!

drufat commented May 3, 2016

Uh oh!

pavanky commented May 3, 2016

Uh oh!

drufat commented May 3, 2016

Uh oh!

pavanky commented May 3, 2016

Uh oh!

drufat commented May 3, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drufat commented May 3, 2016

Uh oh!

pavanky commented May 3, 2016

Uh oh!

pavanky commented May 3, 2016

Uh oh!

drufat commented May 3, 2016

Uh oh!

pavanky commented May 3, 2016

Uh oh!

drufat commented May 3, 2016

Uh oh!

pavanky commented May 3, 2016

Uh oh!

drufat commented May 3, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants