Skip to content

Commit

Permalink
Updating documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
lightsighter committed Feb 7, 2015
1 parent 9135c65 commit 16a4687
Showing 1 changed file with 132 additions and 11 deletions.
143 changes: 132 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,22 @@ Weft

A Sound and Complete Verification Tool for Warp-Specialized GPU Kernels

Update! Our paper on Weft, **Verification of Producer-Consumer
Synchronization in GPU Programs** will be appearing at
[PLDI 2015](http://conf.researchr.org/home/pldi2015).

Navigation
====

1. [Overview](#overview)
2. [Prerequisites](#prerequisites)
3. [Downloading Weft](#downloading-and-building-weft)
4. [Using Weft](#using-weft)
5. [Command Line Arguments](#command-line-arguments)

Overview
====

Weft is a sound and complete verification tool for warp-specialized
kernels that use named barriers on NVIDIA GPUs. Warp-specialized
kernels can encode arbitrary producer-consumer relationships between
Expand All @@ -23,7 +39,7 @@ three important properties of any warp-specialized kernel.
Weft performs a fully static analysis which requires that the use of
named barriers and shared memory accesses be statically analyzable.
All operations which are not statically analyzable are ignored and
can optionally be reported. In practice, we have found that For most
can optionally be reported. In practice, we have found that for most
GPU kernels this is not an issue because synchronization and shared
memory accesses are not dependent on program input and therefore
can be verified statically.
Expand All @@ -32,26 +48,131 @@ Due to its generality, Weft is also capable of checking non-warp-specialized
code as well for race freedom. The one caveat is that Weft currently
does not attempt to check code that uses atomics.

To use Weft, request that the CUDA compiler generate PTX for a kernel
using the <em>--ptx</em> flag. The output PTX file can then be passed
directly to Weft. Weft currently assumes that there is only one kernel
per file.
Prerequisites
====

Weft requires an installation of the CUDA compiler for generating
input PTX files. The CUDA toolkit can be downloaded
[here](https://developer.nvidia.com/cuda-downloads). Weft requires
at least CUDA version 5.0 or later.

Weft can be built with a standard C++ compiler. We've built Weft
with g++ and clang on both Linux and Mac systems.

Downloading and Building Weft
====

Weft is available on github under the Apache Software License
version 2.0. To clone a copy of the Weft source type:

$ git clone https://github.com/lightsighter/Weft.git

After cloning the repository, change into `src` directory
and type:

$ make

This will build the Weft binary. You may wish to add the
directory containing the Weft binary to your path.

The examples directory contains several CUDA kernels and Makefiles for
generating the associated PTX code.
$ export PATH=$PATH:/<path_to_weft>/src

Using Weft
====

Using Weft to validate a CUDA source file is straightforward.
The first step is to use the CUDA compiler to generate a PTX
file for Weft to consume as input. Currently, Weft will only
analyze the first kernel that it finds in a PTX file, so files
containing multiple kernels should be divided into separate
source files.

To generate input for Weft, the CUDA compiler should be
invoked with the `-ptx` flag to create an output PTX file.
We also recommend the CUDA compiler be called with the
`-lineinfo` flag so Weft can provide output based on CUDA
source code line numbers instead of PTX line numbers. In
some cases, the flags for compute architecture (`-arch`) and
machine size (`-m`) may need to be specified depending on the
kernel being compiled. Below are the two ways that we invoke
the CUDA compiler on all of our example kernels for the
Fermi and Kepler architectures respectively.

Fermi: $ nvcc -ptx -lineinfo -m64 -arch=compute\_20 source.cu
Kepler: $ nvcc -ptx -lineinfo -m64 -arch=compute\_35 source.cu

The resulting PTX file is the input to Weft. The PTX file name
can either be specified to Weft using the `-f` flag or as the
last argument.

$ weft source.ptx

As part of its validation, Weft needs to know how many threads
are in each CTA. For kernels with 1-D CTAs, Weft can infer this
information if the `__launch_bounds__` annotation was given on
the original kernel. However, if this declaration did not exits on
the original source kernel, then it must be explicitly specified
using the `-n` flag. As an example, our `saxpy_single.cu` source
file contains has no `__launch_bounds__` declaration on its
kernel, therefore we must tell Weft that the kernel assumes CTAs
of 320 threads.

$ weft -n 320 saxpy\_single.ptx

Note that the `-n` flag should also be used to specify multi-dimensional
CTA shapes which cannot be captured by the `__launch_bounds__`
annotation. Both of the following are valid examples:

$ weft -n 320x1x1 saxpy\_single.ptx
$ weft -n 16x16 dgemm.ptx

Weft supports a large set of command line flags which we cover in
more detail [later](#command-line-arguments). We mention two flags
briefly now as they are often useful for many users. First, by default,
Weft does not assume <em>warp synchronous</em> execution where all
threads in a warp execute in lock-step. Many CUDA programs rely on
this property for correctness. The warp synchronous execution assumption
can be enabled in Weft by passing the `-s` flag on the command line.
As an example, the Fermi chemistry kernel in `examples/DME/chem\_fermi.cu`
will report races if run under normal assumptions, but will always be
race free under a warp synchronous execution.

Another useful flag for Weft is the `-t` flag which controls the
number of parallel threads that Weft will use when performing validation.
For most multi-core architectures we find that 2-4 threads is a good
option. Weft is primarily a memory bound application, and having two
threads per socket is usually sufficient to saturate memory bandwidth.

We have provided many example kernels for Weft in the `examples`
directory. Each individual directory contains its own Makefile for
generating the PTX code for each kernel. We also have a script called
in `run_examples.sh` in the main `examples` directory which will
validate all of the example kernels. Note that some kernels will
successfully report races. The script may take between 30 minutes
and 1 hour to validate all of the kernels.

Command Line Arguments
====

Below is a summary of the command line flags that Weft accepts.
Below is a summary of the command line flags that Weft supports.

* <em>-b</em>: specify the CTA id to simulate (default 0x0x0)
* <em>-d</em>: print detailed information when giving error output,
including where threads are blocked for deadlock as
well as per-thread and per-address information for races
* <em>-f</em>: specify the input PTX file (can be omitted if
the file is the last argument in the command line)
* <em>-g</em>: specify the grid dimensions for the kernel being simulated
(this argument can be omitted in most cases as many kernels
will not depend on these values)
* <em>-i</em>: instrument the execution of Weft to report the
time taken and memory usage for each stage
* <em>-n</em>: set the number of threads per CTA. This is required
if the CUDA kernel did not have a
<em>\_\_launch_bounds\_\_</em> annotation
* <em>-s</em>: assume warp-synchronous exeuction when checking for races
* <em>-t</em>: set the size of the thread pool for Weft to use. In
general, Weft is memory bound, so one thread per socket
* <em>-s</em>: assume warp-synchronous execution when checking for races
* <em>-t</em>: set the size of the thread pool for Weft to use; in
general, Weft is memory bound, so one or two threads per socket
should be sufficient for achieving peak performance.
* <em>-v</em>: enable verbose output
* <em>-w</em>: enable warnings about PTX instructions that cannot be
Expand Down

0 comments on commit 16a4687

Please sign in to comment.