Skip to content

Commit 2fa6572

Browse files
authored
Add some performance tips to the documentation (#1999)
[skip tests]
1 parent 98450e1 commit 2fa6572

File tree

2 files changed

+179
-0
lines changed

2 files changed

+179
-0
lines changed

docs/make.jl

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@ function main()
1313
repo_root_url="$src/blob/master/docs")
1414
Literate.markdown("src/tutorials/custom_structs.jl", "src/tutorials";
1515
repo_root_url="$src/blob/master/docs")
16+
Literate.markdown("src/tutorials/performance.jl", "src/tutorials";
17+
repo_root_url="$src/blob/master/docs")
1618
end
1719

1820
@info "Generating Documenter.jl site"
@@ -36,6 +38,7 @@ function main()
3638
"Tutorials" => Any[
3739
"tutorials/introduction.md",
3840
"tutorials/custom_structs.md",
41+
"tutorials/performance.md"
3942
],
4043
"Installation" => Any[
4144
"installation/overview.md",

docs/src/tutorials/performance.jl

Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
# # Performance Tips
2+
3+
# ## General Tips
4+
5+
# Always start by profiling your code (see the [Profiling](../development/profiling.md) page for more details). You first want to analyze your application as a whole, using `CUDA.@profile` or NSight Systems, identifying hotspots and bottlenecks. Focusing on these you will want to:
6+
7+
# * Minimize data transfer between the CPU and GPU, you can do this by getting rid of unnecessary memory copies and batching many small transfers into larger ones;
8+
# * Identify problematic kernel invocations: you may be launching thousands of kernels which could be fused into a single call;
9+
# * Find stalls, where the CPU isn't submitting work fast enough to keep the GPU busy.
10+
11+
# If that isn't sufficient, and you identified a kernel that executes slowly, you can try using NSight Compute to analyze that kernel in detail. Some things to look out for in order of importance
12+
# * Memory optimizations are the most important area for performance. Hence optimizing memory accesses, e.g., avoiding needless global accesses (buffering in shared memory instead) and coalescing accesses can lead to big performance improvements;
13+
# * Launching more threads on each streaming multiprocessor can be acheived by lowering register pressure and reducing shared memory usage, the tips below outline the various ways in which register pressure can be reduced;
14+
# * Using Float32's instead of Float64's can provide significantly better performance;
15+
# * Avoid using control flow instructions such as `if` which cause branches, e.g. replace an `if` with an `ifelse` if possible;
16+
# * Increase the arithmetic intensity in order for the GPU to be able to hide the latency of memory accesses.
17+
18+
# ### Inlining
19+
20+
# Inlining can reduce register usage and thus speed up kernels. To force inlining of all functions use `@cuda always_inline=true`.
21+
22+
# ### Limiting the Maximum Number of Registers Per Thread
23+
24+
# The number of threads that can be launched is partly determined by the number of registers a kernel uses. This is due to registers being shared between all threads on a multiprocessor.
25+
# Setting the maximum number of registers per thread will force less registers to be used which can increase thread count at the expense of having to spill registers into local memory, this may improve performance. To set the max registers to 32 use `@cuda max_registers=32`.
26+
27+
# ### FastMath
28+
29+
# Use `@fastmath` to use faster versions of common mathematical functions and for even faster square roots use `@cuda fastmath=true`.
30+
31+
# ## Resources
32+
33+
# For further information you can check out these resources.
34+
35+
# NVidia's technical blog has a lot of good tips: [Pro-Tips](https://developer.nvidia.com/blog/tag/pro-tip/), [Optimization](https://developer.nvidia.com/blog/tag/optimization/).
36+
37+
# The [CUDA C++ Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html) is relevant for Julia.
38+
39+
# The following notebooks also have some good tips: [JuliaCon 2021 GPU Workshop](https://github.com/maleadt/juliacon21-gpu_workshop/blob/main/deep_dive/CUDA.ipynb), [Advanced Julia GPU Training](https://github.com/JuliaComputing/Training/tree/master/AdvancedGPU)
40+
41+
# Also see the [perf](https://github.com/JuliaGPU/CUDA.jl/tree/master/perf) folder for some optimised code examples.
42+
43+
# ## Julia Specific Tips
44+
45+
# ### Minimise Runtime Exceptions
46+
47+
# Many common operations can throw errors at runtime in Julia, they often do this by branching and calling a function in that branch both of which are slow on GPUs. Using `@inbounds` when indexing into arrays will eliminate exceptions due to bounds checking. You can also use `assume` from the package LLVM.jl to get rid of exceptions, e.g.
48+
49+
# ```julia
50+
# using LLVM, LLVM.Interop
51+
52+
# function test(x, y)
53+
# assume(x > 0)
54+
# div(y, x)
55+
# end
56+
# ```
57+
58+
# The `assume(x > 0)` tells the compiler that there cannot be a divide by 0 error.
59+
60+
# For more information and examples check out [Kernel analysis and optimization](https://github.com/JuliaComputing/Training/blob/master/AdvancedGPU/2-2-kernel_analysis_optimization.ipynb).
61+
62+
# ### 32-bit Integers
63+
64+
# Use 32-bit integers where possible. A common source of register pressure is the use of 64-bit integers when only 32-bits are required. For example, the hardware's indices are 32-bit integers, but Julia's literals are Int64's which results in expressions like blockIdx().x-1 to be promoted to 64-bit integers. To use 32-bit integers we can instead replace the `1` with `Int32(1)` or more succintly `1i32` if you run `using CUDA: i32`
65+
66+
# To see how much of a difference this makes let's use a kernel introduced in the [introduction](../introduction) for inplace addition.
67+
68+
using CUDA, BenchmarkTools
69+
70+
function gpu_add3!(y, x)
71+
index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
72+
stride = gridDim().x * blockDim().x
73+
for i = index:stride:length(y)
74+
@inbounds y[i] += x[i]
75+
end
76+
return
77+
end
78+
79+
# Now let's see how many registers are used:
80+
81+
# ```julia
82+
# x_d = CUDA.fill(1.0f0, 2^28)
83+
# y_d = CUDA.fill(2.0f0, 2^28)
84+
#
85+
# CUDA.registers(@cuda gpu_add3!(y_d, x_d))
86+
# ```
87+
88+
# ```
89+
# 29
90+
# ```
91+
92+
# Our kernel using 32-bit integers is below
93+
94+
function gpu_add4!(y, x)
95+
index = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x
96+
stride = gridDim().x * blockDim().x
97+
for i = index:stride:length(y)
98+
@inbounds y[i] += x[i]
99+
end
100+
return
101+
end
102+
103+
# ```julia
104+
# CUDA.registers(@cuda gpu_add4!(y_d, x_d))
105+
# ```
106+
107+
# ```
108+
# 28
109+
# ```
110+
111+
# So we use one less register by switching to 32 bit integers, for kernels using even more 64 bit integers we would expect to see larger falls in register count.
112+
113+
# ### Avoiding `StepRange`
114+
115+
# In the previous kernel in the for loop we iterated over `index:stride:length(y)`, this is a `StepRange`. Unfortunately, constructing a `StepRange` is slow as they can throw errors and they contain unnecessary computation when we just want to loop over them. Instead it is faster to use a while loop like so:
116+
117+
function gpu_add5!(y, x)
118+
index = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x
119+
stride = gridDim().x * blockDim().x
120+
121+
i = index
122+
while i <= length(y)
123+
@inbounds y[i] += x[i]
124+
i += stride
125+
end
126+
return
127+
end
128+
129+
# The benchmark[^1]:
130+
131+
function bench_gpu4!(y, x)
132+
kernel = @cuda launch=false gpu_add4!(y, x)
133+
config = launch_configuration(kernel.fun)
134+
threads = min(length(y), config.threads)
135+
blocks = cld(length(y), threads)
136+
137+
CUDA.@sync kernel(y, x; threads, blocks)
138+
end
139+
140+
function bench_gpu5!(y, x)
141+
kernel = @cuda launch=false gpu_add5!(y, x)
142+
config = launch_configuration(kernel.fun)
143+
threads = min(length(y), config.threads)
144+
blocks = cld(length(y), threads)
145+
146+
CUDA.@sync kernel(y, x; threads, blocks)
147+
end
148+
149+
150+
# ```julia
151+
# @btime bench_gpu4!($y_d, $x_d)
152+
# ```
153+
154+
# ```
155+
# 76.149 ms (57 allocations: 3.70 KiB)
156+
# ```
157+
158+
# ```julia
159+
# @btime bench_gpu5!($y_d, $x_d)
160+
# ```
161+
162+
# ```
163+
# 75.732 ms (58 allocations: 3.73 KiB)
164+
# ```
165+
166+
# This benchmark shows there is a only a small performance benefit for this kernel however we can see a big difference in the amount of registers used, recalling that 28 registers were used when using a `StepRange`:
167+
168+
# ```julia
169+
# CUDA.registers(@cuda gpu_add5!(y_d, x_d))
170+
# ```
171+
172+
# ```
173+
# 12
174+
# ```
175+
176+
# [^1]: Conducted on Julia Version 1.9.2, the benefit of this technique should be reduced on version 1.10 or by using `always_inline=true` on the `@cuda` macro, e.g. `@cuda always_inline=true launch=false gpu_add4!(y, x)`.

0 commit comments

Comments
 (0)