From 16a468761199e321c8989fdac055b8f294c1d984 Mon Sep 17 00:00:00 2001
From: Mike <mebauer@cs.stanford.edu>
Date: Sat, 7 Feb 2015 14:15:46 -0800
Subject: [PATCH] Updating documentation

---
 README.md | 143 +++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 132 insertions(+), 11 deletions(-)
diff --git a/README.md b/README.md
index 4519955..7e7bbb8 100644
--- a/README.md
+++ b/README.md
@@ -3,6 +3,22 @@ Weft
 
 A Sound and Complete Verification Tool for Warp-Specialized GPU Kernels
 
+Update! Our paper on Weft, **Verification of Producer-Consumer
+Synchronization in GPU Programs** will be appearing at
+[PLDI 2015](http://conf.researchr.org/home/pldi2015).
+
+Navigation
+====
+
+1. [Overview](#overview)
+2. [Prerequisites](#prerequisites)
+3. [Downloading Weft](#downloading-and-building-weft)
+4. [Using Weft](#using-weft)
+5. [Command Line Arguments](#command-line-arguments)
+
+Overview
+====
+
 Weft is a sound and complete verification tool for warp-specialized 
 kernels that use named barriers on NVIDIA GPUs. Warp-specialized 
 kernels can encode arbitrary producer-consumer relationships between 
@@ -23,7 +39,7 @@ three important properties of any warp-specialized kernel.
 Weft performs a fully static analysis which requires that the use of 
 named barriers and shared memory accesses be statically analyzable.
 All operations which are not statically analyzable are ignored and 
-can optionally be reported. In practice, we have found that For most 
+can optionally be reported. In practice, we have found that for most 
 GPU kernels this is not an issue because synchronization and shared 
 memory accesses are not dependent on program input and therefore
 can be verified statically.
@@ -32,26 +48,131 @@ Due to its generality, Weft is also capable of checking non-warp-specialized
 code as well for race freedom. The one caveat is that Weft currently
 does not attempt to check code that uses atomics.
 
-To use Weft, request that the CUDA compiler generate PTX for a kernel
-using the <em>--ptx</em> flag. The output PTX file can then be passed
-directly to Weft. Weft currently assumes that there is only one kernel 
-per file.
+Prerequisites
+====
+
+Weft requires an installation of the CUDA compiler for generating
+input PTX files. The CUDA toolkit can be downloaded 
+[here](https://developer.nvidia.com/cuda-downloads). Weft requires
+at least CUDA version 5.0 or later.
+
+Weft can be built with a standard C++ compiler. We've built Weft
+with g++ and clang on both Linux and Mac systems.
+
+Downloading and Building Weft
+====
+
+Weft is available on github under the Apache Software License
+version 2.0. To clone a copy of the Weft source type:
+
+  $ git clone https://github.com/lightsighter/Weft.git
+
+After cloning the repository, change into `src` directory
+and type:
+
+  $ make
+
+This will build the Weft binary. You may wish to add the 
+directory containing the Weft binary to your path.
 
-The examples directory contains several CUDA kernels and Makefiles for
-generating the associated PTX code.
+  $ export PATH=$PATH:/<path_to_weft>/src
+
+Using Weft
+====
+
+Using Weft to validate a CUDA source file is straightforward.
+The first step is to use the CUDA compiler to generate a PTX
+file for Weft to consume as input. Currently, Weft will only
+analyze the first kernel that it finds in a PTX file, so files
+containing multiple kernels should be divided into separate
+source files.
+
+To generate input for Weft, the CUDA compiler should be
+invoked with the `-ptx` flag to create an output PTX file.
+We also recommend the CUDA compiler be called with the 
+`-lineinfo` flag so Weft can provide output based on CUDA 
+source code line numbers instead of PTX line numbers. In 
+some cases, the flags for compute architecture (`-arch`) and 
+machine size (`-m`) may need to be specified depending on the
+kernel being compiled. Below are the two ways that we invoke 
+the CUDA compiler on all of our example kernels for the
+Fermi and Kepler architectures respectively.
+
+Fermi:  $ nvcc -ptx -lineinfo -m64 -arch=compute\_20 source.cu
+Kepler: $ nvcc -ptx -lineinfo -m64 -arch=compute\_35 source.cu
+
+The resulting PTX file is the input to Weft. The PTX file name
+can either be specified to Weft using the `-f` flag or as the
+last argument.
+
+  $ weft source.ptx
+
+As part of its validation, Weft needs to know how many threads
+are in each CTA. For kernels with 1-D CTAs, Weft can infer this
+information if the `__launch_bounds__` annotation was given on
+the original kernel. However, if this declaration did not exits on
+the original source kernel, then it must be explicitly specified
+using the `-n` flag. As an example, our `saxpy_single.cu` source
+file contains has no `__launch_bounds__` declaration on its
+kernel, therefore we must tell Weft that the kernel assumes CTAs
+of 320 threads.
+
+  $ weft -n 320 saxpy\_single.ptx
+
+Note that the `-n` flag should also be used to specify multi-dimensional
+CTA shapes which cannot be captured by the `__launch_bounds__` 
+annotation. Both of the following are valid examples:
+
+  $ weft -n 320x1x1 saxpy\_single.ptx
+  $ weft -n 16x16 dgemm.ptx
+
+Weft supports a large set of command line flags which we cover in
+more detail [later](#command-line-arguments). We mention two flags
+briefly now as they are often useful for many users. First, by default,
+Weft does not assume <em>warp synchronous</em> execution where all
+threads in a warp execute in lock-step. Many CUDA programs rely on 
+this property for correctness. The warp synchronous execution assumption
+can be enabled in Weft by passing the `-s` flag on the command line.
+As an example, the Fermi chemistry kernel in `examples/DME/chem\_fermi.cu`
+will report races if run under normal assumptions, but will always be 
+race free under a warp synchronous execution.
+
+Another useful flag for Weft is the `-t` flag which controls the 
+number of parallel threads that Weft will use when performing validation.
+For most multi-core architectures we find that 2-4 threads is a good
+option. Weft is primarily a memory bound application, and having two
+threads per socket is usually sufficient to saturate memory bandwidth.
+
+We have provided many example kernels for Weft in the `examples` 
+directory. Each individual directory contains its own Makefile for
+generating the PTX code for each kernel. We also have a script called
+in `run_examples.sh` in the main `examples` directory which will 
+validate all of the example kernels. Note that some kernels will 
+successfully report races. The script may take between 30 minutes
+and 1 hour to validate all of the kernels.
+
+Command Line Arguments
+====
 
-Below is a summary of the command line flags that Weft accepts.
+Below is a summary of the command line flags that Weft supports.
 
+ * <em>-b</em>: specify the CTA id to simulate (default 0x0x0)
+ * <em>-d</em>: print detailed information when giving error output,
+                including where threads are blocked for deadlock as
+                well as per-thread and per-address information for races
  * <em>-f</em>: specify the input PTX file (can be omitted if 
                 the file is the last argument in the command line)
+ * <em>-g</em>: specify the grid dimensions for the kernel being simulated
+                (this argument can be omitted in most cases as many kernels
+                will not depend on these values)
  * <em>-i</em>: instrument the execution of Weft to report the
                 time taken and memory usage for each stage
  * <em>-n</em>: set the number of threads per CTA. This is required
                 if the CUDA kernel did not have a 
                 <em>\_\_launch_bounds\_\_</em> annotation
- * <em>-s</em>: assume warp-synchronous exeuction when checking for races
- * <em>-t</em>: set the size of the thread pool for Weft to use. In
-                general, Weft is memory bound, so one thread per socket
+ * <em>-s</em>: assume warp-synchronous execution when checking for races
+ * <em>-t</em>: set the size of the thread pool for Weft to use; in
+                general, Weft is memory bound, so one or two threads per socket
                 should be sufficient for achieving peak performance.
  * <em>-v</em>: enable verbose output
  * <em>-w</em>: enable warnings about PTX instructions that cannot be