A Sound and Complete Verification Tool for Warp-Specialized GPU Kernels
Update! Our paper on Weft, Verification of Producer-Consumer Synchronization in GPU Programs will be appearing at PLDI 2015.
Weft is a sound and complete verification tool for warp-specialized kernels that use named barriers on NVIDIA GPUs. Warp-specialized kernels can encode arbitrary producer-consumer relationships between different subsets of warps within a kernel using named barriers. This requires a more general analysis than most current GPU verification tools provide.
Weft operates on the PTX code emitted by the CUDA compiler and verifies three important properties of any warp-specialized kernel.
- Deadlock Freedom - the use of named barriers should not result in deadlock.
- Safe Barrier Recycling - named barriers are a limited physical resource and it is important to check that they are safely recycled.
- Race Freedom - checking that all shared memory accesses are properly synchronized by named barriers.
Weft performs a fully static analysis which requires that the use of named barriers and shared memory accesses be statically analyzable. All operations which are not statically analyzable are ignored and can optionally be reported. In practice, we have found that for most GPU kernels this is not an issue because synchronization and shared memory accesses are not dependent on program input and therefore can be verified statically.
Due to its generality, Weft is also capable of checking non-warp-specialized code as well for race freedom. The one caveat is that Weft currently does not attempt to check code that uses atomics.
Weft requires an installation of the CUDA compiler for generating input PTX files. The CUDA toolkit can be downloaded here. Weft requires CUDA version 5.5 or later.
Weft can be built with a standard C++ compiler. Weft has been tested with g++ and clang on both Linux and Mac systems.
Weft is available on github under the Apache Software License version 2.0. To clone a copy of the Weft source type:
$ git clone https://github.com/lightsighter/Weft.git
After cloning the repository, change into the src
directory
and type:
$ make
This will build the Weft binary weft
. You may wish to add the
directory containing the Weft binary to your path using the
following command.
$ export PATH=$PATH:/<path_to_weft>/src
Using Weft to validate a CUDA source file is straightforward. The first step is to use the CUDA compiler to generate a PTX file for Weft to consume as input. Currently, Weft will only analyze the first kernel that it finds in a PTX file, so files containing multiple kernels should be divided into separate source files.
To generate input for Weft, the CUDA compiler should be
invoked with the -ptx
flag to create an output PTX file.
We also recommend the CUDA compiler be called with the
-lineinfo
flag so Weft can provide output based on CUDA
source code line numbers instead of PTX line numbers. In
some cases, the flags for compute architecture (-arch
) and
machine size (-m
) may need to be specified depending on the
kernel being compiled. Below are the two ways that we invoke
the CUDA compiler on all of our example kernels for the
Fermi and Kepler architectures respectively.
$ nvcc -ptx -lineinfo -m64 -arch=compute_20 source.cu
$ nvcc -ptx -lineinfo -m64 -arch=compute_35 source.cu
The resulting PTX file is the input to Weft. The PTX file name
can either be specified to Weft using the -f
flag or as the
last argument.
$ weft -f source.ptx -s -t 4
$ weft -s -t 4 source.ptx
As part of its validation, Weft needs to know how many threads
are in each CTA. For kernels with 1-D CTAs, Weft can infer this
information if the __launch_bounds__
annotation was given on
the CUDA original kernel. However, if this declaration did not exits on
the original source kernel, then it must be explicitly specified
using the -n
flag. As an example, our saxpy_single.cu
source
file contains no __launch_bounds__
declaration on its
kernel, therefore we must tell Weft that the kernel requires CTAs
containing 320 threads.
$ weft -n 320 saxpy_single.ptx
Note that the -n
flag should also be used to specify multi-dimensional
CTA shapes which cannot be captured by the __launch_bounds__
annotation. Both of the following are valid examples:
$ weft -n 320x1x1 saxpy_single.ptx
$ weft -n 16x16 dgemm.ptx
Weft supports a large set of command line flags which we cover in
more detail later. We mention two flags
briefly now as they are often useful for many users. First, by default,
Weft does not assume warp synchronous execution where all
threads in a warp execute in lock-step. Many CUDA programs rely on
this property for correctness. The warp synchronous execution assumption
can be enabled in Weft by passing the -s
flag on the command line.
As an example, the Fermi chemistry kernel in examples/DME/chem_fermi.cu
will report races if run under normal assumptions, but will always be
race free under a warp synchronous execution.
Another useful flag for Weft is the -t
flag which controls the
number of parallel threads that Weft will use when performing validation.
For most multi-core architectures we find that 2-4 threads is a good
option. Weft is primarily a memory bound application, and having two
threads per socket is usually sufficient to saturate memory bandwidth.
We have provided a set of test kernels for Weft in the examples
directory. Each individual directory contains its own Makefile for
generating the PTX code for individual kernels. We also have a script
called run_examples.sh
in the main examples
directory which will
validate all of the example kernels. Note that some kernels will
report races. The script may take between 30 minutes
and 1 hour (depending on the machine) to validate all of the kernels.
Below is a summary of the command line flags that Weft supports.
-b
: specify the CTA id to simulate (default 0x0x0)-d
: print detailed information when giving error output, including where threads are blocked for deadlock as well as per-thread and per-address information for races-f
: specify the input PTX file (can be omitted if the file is the last argument in the command line)-g
: specify the grid dimensions for the kernel being simulated (this argument can be omitted in most cases as many kernels will not depend on these values; regardless of the grid bounds Weft will always validate a single CTA specified by the-b
flag)-i
: instrument the execution of Weft to report the time taken and memory usage for each stage-n
: set the number of threads per CTA. This is required if the CUDA kernel did not have a__launch_bounds__
annotation-p
: print out individual files for each thread of all Weft modeled instructions, this will generate one file per thread-s
: assume warp-synchronous execution when checking for races-t
: set the size of the thread pool for Weft to use; in general, Weft is memory bound, so one or two threads per socket should be sufficient for achieving peak performance.-v
: enable verbose output-w
: enable warnings about PTX instructions that cannot be statically emulated (can result in large output)