This repository has been archived by the owner on Mar 12, 2021. It is now read-only.
This repository has been archived by the owner on Mar 12, 2021. It is now read-only.
Performance regression with mapreduce #611
Closed
Description
Here's an example for me on the master branch:
julia> using BenchmarkTools, CuArrays
julia> function pi_mc_cu(nsamples)
xs = CuArrays.rand(nsamples); ys = CuArrays.rand(nsamples)
mapreduce((x, y) -> (x^2 + y^2) < 1.0, +, xs, ys, init=0) * 4/nsamples
end
pi_mc_cu (generic function with 1 method)
julia> @benchmark pi_mc_cu(10000000)
BenchmarkTools.Trial:
memory estimate: 16.63 KiB
allocs estimate: 473
--------------
minimum time: 1.620 ms (0.00% GC)
median time: 1.666 ms (0.00% GC)
mean time: 1.709 ms (1.60% GC)
maximum time: 9.460 ms (7.77% GC)
--------------
samples: 2921
evals/sample: 1
(@v1.4) pkg> st CuArrays
Status `~/.julia/environments/v1.4/Project.toml`
[3a865a2d] CuArrays v1.7.0 #master (https://github.com/JuliaGPU/CuArrays.jl.git)
(@v1.4) pkg> st CUDAnative
Status `~/.julia/environments/v1.4/Project.toml`
[be33ccc6] CUDAnative v2.10.2 #master (https://github.com/JuliaGPU/CUDAnative.jl.git)
and here's that same example on the latest tagged version:
julia> using BenchmarkTools, CuArrays
julia> function pi_mc_cu(nsamples)
xs = CuArrays.rand(nsamples); ys = CuArrays.rand(nsamples)
mapreduce((x, y) -> (x^2 + y^2) < 1.0, +, xs, ys, init=0) * 4/nsamples
end
pi_mc_cu (generic function with 1 method)
julia> @benchmark pi_mc_cu(10000000)
BenchmarkTools.Trial:
memory estimate: 4.61 KiB
allocs estimate: 126
--------------
minimum time: 594.302 μs (0.00% GC)
median time: 659.321 μs (0.00% GC)
mean time: 667.914 μs (1.58% GC)
maximum time: 2.338 ms (39.61% GC)
--------------
samples: 7463
evals/sample: 1
(@v1.4) pkg> st CuArrays
Status `~/.julia/environments/v1.4/Project.toml`
[3a865a2d] CuArrays v1.7.2
(@v1.4) pkg> st CUDAnative
Status `~/.julia/environments/v1.4/Project.toml`
[be33ccc6] CUDAnative v2.10.2
As you can see, I lost around a factor of 3 performance on the new master. I tested the master version with and without JULIA_CUDA_USE_BINARYBUILDER=false
, so binary builder is not the problem. Likely due to #602
julia> versioninfo()
Julia Version 1.4.0-rc1.0
Commit b0c33b0cf5* (2020-01-23 17:23 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Ryzen 5 2600 Six-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, znver1)
Environment:
JULIA_NUM_THREADS = 6
[mason@mason-pc ~]$ sudo pacman -Q --info cuda
Name : cuda
Version : 10.2.89-3
Description : NVIDIA's GPU programming toolkit
Architecture : x86_64
URL : https://developer.nvidia.com/cuda-zone
Licenses : custom:NVIDIA
Groups : None
Provides : cuda-toolkit cuda-sdk
Depends On : gcc8-libs gcc8 opencl-nvidia nvidia-utils
Optional Deps : gdb: for cuda-gdb
java-runtime=8: for nsight and nvvp
Required By : cudnn
Optional For : None
Conflicts With : None
Replaces : cuda-toolkit cuda-sdk
Installed Size : 4.04 GiB
Packager : Sven-Hendrik Haase <svenstaro@gmail.com>
Build Date : Tue 31 Dec 2019 01:07:53 AM MST
Install Date : Wed 26 Feb 2020 03:04:42 PM MST
Install Reason : Explicitly installed
Install Script : Yes
Validated By : Signature
[mason@mason-pc ~]$ lspci -v -s $(lspci | grep ' VGA ' | cut -d" " -f 1)
1f:00.0 VGA compatible controller: NVIDIA Corporation TU106 [GeForce RTX 2060 Rev. A] (rev a1) (prog-if 00 [VGA controller])
Subsystem: ZOTAC International (MCO) Ltd. TU106 [GeForce RTX 2060 Rev. A]
Flags: bus master, fast devsel, latency 0, IRQ 71
Memory at f6000000 (32-bit, non-prefetchable) [size=16M]
Memory at e0000000 (64-bit, prefetchable) [size=256M]
Memory at f0000000 (64-bit, prefetchable) [size=32M]
I/O ports at e000 [size=128]
[virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
Capabilities: <access denied>
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, nvidia
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment