Skip to content

Add numpy benchmarks to fft and blas. #80

Merged
pavanky merged 3 commits intoarrayfire:develfrom
drufat:numpy
May 3, 2016
Merged

Add numpy benchmarks to fft and blas. #80
pavanky merged 3 commits intoarrayfire:develfrom
drufat:numpy

Conversation

@drufat
Copy link
Contributor

@drufat drufat commented May 3, 2016

Compares the following functions

arrayfire numpy
af.matmul np.dot
af.fft2 np.fft.fft2

Also switch to timeit for benchmark measurements. timeit turns off the python garbage collection during measurements, so the results may be more accurate.

@pavanky
Copy link
Member

pavanky commented May 3, 2016

@drufat calling af.sync() for each run on the GPU is going to significantly slow it down. In real world code you are only ever going to call af.sync() occasionally.

The proper way to benchmark it would be to call the functions inside run for a certain number of iterations and then call af.sync()

@pavanky
Copy link
Member

pavanky commented May 3, 2016

Can you also change bench_device to bench_arrayfire because it makes more sense this way.

@drufat
Copy link
Contributor Author

drufat commented May 3, 2016

@pavanky My understanding is that the purpose of running multiple iterations in a benchmark is to reduce the variability in the result - the actual computation should not be modified and the comparison should still be based on the time it takes for a single run to complete. If af.sync() is required to do the computation for a single run, then it should be part of the benchmark as otherwise you would be unfairly favoring arrayfire by dividing some of its time over all the iteration.

Here is a comparison of the run times on my computer for the old version and the new version.

old bench_blas.py

ArrayFire v3.3.2 (CUDA, 64-bit Linux, build default)
Platform: CUDA Toolkit 7.5, Driver: 364.19
[0] GeForce GTX 960, 2040 MB, CUDA Compute 5.2
Benchmark N x N matrix multiply
Time taken for  128 x  128: 6.0806 Gflops
Time taken for  256 x  256: 1149.6282 Gflops
Time taken for  384 x  384: 1423.8280 Gflops
Time taken for  512 x  512: 2057.0017 Gflops
Time taken for  640 x  640: 1838.6790 Gflops
Time taken for  768 x  768: 1910.5981 Gflops
Time taken for  896 x  896: 2059.2579 Gflops
Time taken for 1024 x 1024: 2405.0432 Gflops
Time taken for 1152 x 1152: 2275.2714 Gflops
Time taken for 1280 x 1280: 2331.1159 Gflops
Time taken for 1408 x 1408: 2382.6808 Gflops
Time taken for 1536 x 1536: 2511.4524 Gflops
Time taken for 1664 x 1664: 2439.3697 Gflops
Time taken for 1792 x 1792: 2468.1887 Gflops
Time taken for 1920 x 1920: 2469.8352 Gflops
Time taken for 2048 x 2048: 2563.9184 Gflops

new bench_blas.py

ArrayFire v3.3.2 (CUDA, 64-bit Linux, build default)
Platform: CUDA Toolkit 7.5, Driver: 364.19
[0] GeForce GTX 960, 2040 MB, CUDA Compute 5.2
Benchmark N x N matrix multiply on arrayfire
Time taken for  128 x  128: 6.0006 Gflops
Time taken for  256 x  256: 609.4753 Gflops
Time taken for  384 x  384: 947.4411 Gflops
Time taken for  512 x  512: 1546.0062 Gflops
Time taken for  640 x  640: 1759.6338 Gflops
Time taken for  768 x  768: 1812.2530 Gflops
Time taken for  896 x  896: 1928.0725 Gflops
Time taken for 1024 x 1024: 2335.1287 Gflops
Time taken for 1152 x 1152: 2193.2056 Gflops
Time taken for 1280 x 1280: 2295.2198 Gflops
Time taken for 1408 x 1408: 2338.1797 Gflops
Time taken for 1536 x 1536: 2501.7196 Gflops
Time taken for 1664 x 1664: 2403.7006 Gflops
Time taken for 1792 x 1792: 2450.4298 Gflops
Time taken for 1920 x 1920: 2455.5042 Gflops
Time taken for 2048 x 2048: 2545.7770 Gflops
Benchmark N x N matrix multiply on numpy
Time taken for  128 x  128: 3.9541 Gflops
Time taken for  256 x  256: 4.1893 Gflops
Time taken for  384 x  384: 4.2883 Gflops
Time taken for  512 x  512: 4.3267 Gflops

@pavanky
Copy link
Member

pavanky commented May 3, 2016

@drufat You don't need to call af.sync(). This is only required at the end of a benchmark.

If you are calling it in every iteration, it is going to increase the overhead. For blas, this is a smaller overhead (about 5-10% at smaller sizes), but it is going to be higher for FFT.

@drufat
Copy link
Contributor Author

drufat commented May 3, 2016

Here are the benchmarks for the fft

old bench_fft.py

ArrayFire v3.3.2 (CUDA, 64-bit Linux, build default)
Platform: CUDA Toolkit 7.5, Driver: 364.19
[0] GeForce GTX 960, 2040 MB, CUDA Compute 5.2
Benchmark N x N 2D fft
Time taken for  128 x  128: 0.4803 Gflops
Time taken for  256 x  256: 79.5998 Gflops
Time taken for  512 x  512: 95.8458 Gflops
Time taken for 1024 x 1024: 119.8499 Gflops
Time taken for 2048 x 2048: 134.8493 Gflops
Time taken for 4096 x 4096: 147.1597 Gflops

new bench_fft.py

ArrayFire v3.3.2 (CUDA, 64-bit Linux, build default)
Platform: CUDA Toolkit 7.5, Driver: 364.19
[0] GeForce GTX 960, 2040 MB, CUDA Compute 5.2
Benchmark N x N 2D fft on arrayfire
Time taken for  128 x  128: 0.1194 Gflops
Time taken for  256 x  256: 68.5797 Gflops
Time taken for  512 x  512: 82.1921 Gflops
Time taken for 1024 x 1024: 117.4545 Gflops
Time taken for 2048 x 2048: 134.9769 Gflops
Time taken for 4096 x 4096: 147.8535 Gflops
Benchmark N x N 2D fft on numpy
Time taken for  128 x  128: 3.4478 Gflops
Time taken for  256 x  256: 3.9301 Gflops
Time taken for  512 x  512: 2.9090 Gflops

@pavanky
Copy link
Member

pavanky commented May 3, 2016

@drufat It's still significantly slower at smaller sizes. Besides, always calling sync gives the wrong impression about how things should be timed on the GPU.

@drufat
Copy link
Contributor Author

drufat commented May 3, 2016

Alright. I switched the code back to the time module. Now af.sync() is only run once as per your request.

@pavanky pavanky merged commit ed52b90 into arrayfire:devel May 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants