Add a paragraph on how to use.

timdesrochers · Sep 15, 2022 · b85b0bd · b85b0bd
1 parent 4f476f4
commit b85b0bd
Show file tree

Hide file tree

Showing 2 changed files with 41 additions and 10 deletions.
diff --git a/README.md b/README.md
@@ -6,11 +6,11 @@ This is a single-file re-implementation of [Stable Diffusion](https://github.com
 
 This re-implementation serves and an education for me to understand diffusion models. It is also necessary for my follow-up work to enable Stable Diffusion on mobile devices such as iPad / iPhone. Without a Swift re-implementation, doing mobile-focused optimization with Python would be difficult and impossible to ship in App Store. It is possible to do this differently, such as exporting to ONNX runtime and use that as the driver on mobile devices. That does limit what kind of optimizations you can apply though. As you can tell, running models that totals about 8GiB in-memory and 4GiB at-rest with full floating-point precision is not trivial on mobile devices. It might requires some non-conventional optimizations that may not be available through existing frameworks. Using something I am familiar with (a framework I built) would be a good starting point.
 
-## Where We Are?
+## Where We Are
 
 CLIP text model, UNet diffusion model and the decoder has been ported. The `examples:txt2img` target is useful with some path changesinside `examples/txt2img/main.swift`. Need to port the encoder over to enable `img2img`. Other targets, such as `examples:unet`, `examples:clip`, `examples:autoencoder` are the example programs to convert PyTorch weights to the one s4nnc uses.
 
-## What's Next?
+## What's Next
 
 The next on my list is to implement the tokenizer. Thanks to PythonKit, right now, I am using the tokenizer from Hugging Face. After tokenizer implemented, the whole thing should be able to run without Python dependencies.
 
@@ -20,6 +20,36 @@ Right now, at run time, UNet model uses ~1.5GiB memory in additional to its 3.3G
 
 Converting the model to FP16 would save memory footprint instantly, but this will be close-to-the-last thing to do. Just by using FP16, UNet should use around 1.9GiB memory, which is very manageable on mobile devices now. Given that we can unload UNet model and load decoder from disk when it is done, this combined can, hopefully, finally, run stable diffusion on mobile. We can further quantize weights to int8 with the `LLM.int8()` transformers trick: https://arxiv.org/pdf/2208.07339.pdf.
 
-## Is It Comparable?
+## Is It Comparable
 
 Right now, I didn't run any specific optimizations. Further, the model loading as of today for s4nnc requires executing the model once, and we have some optimization runs (find the most efficient kernels etc.) that are not saved. That has been said, we can compare the execution time of txt2img from Swift v.s. the one from CompVis (there are more optimized forks available, but going through them to find the best would take time) of the diffusion process + decoding process. The Swift txt2img on GPU took about 17s while the CompVis took about 11s (both with one 2080 Ti). Cursory look at `nvprof` output shows that transpose and not using cublasLt the leading cause for the extra 6s spent.
+
+## How to Run This
+
+There are quite a bit of setup right now. As I get all bits moved to Swift and start CPU / Metal work, it should be easier. It would also help if I move to support SwiftPM on s4nnc side. But that is not as high priority as the other two.
+
+First, you need to install Bazel and various dependencies for s4nnc. To install Bazel, follow: https://bazel.build/install.
+
+Other dependencies include Swift compiler, CUDA (10.2 and above) and clang. For former two, you have to install yourself. For the others, if you are on Debian-like system, you can install with:
+
+```
+sudo apt install clang llvm libicu-dev libpng-dev libjpeg-dev libatlas-base-dev libblas-dev libgsl-dev libdispatch-dev libomp-dev libfftw3-dev
+```
+
+For now, you need to install `transformers` for the tokenizer.
+
+```
+virtualenv -p python3 _env
+source _env/bin/activate
+pip install transformers
+```
+
+You also need to download the model. I put the Stable Diffusion v1.4 model in http://static.libccv.org/sd-v1.4.ckpt. Note that this is a s4nnc-compatible file, not PyTorch one you download elsewhere. You can check related examples for how this file is generated.
+
+With these, you can run:
+
+```
+bazel run examples:txt2img --compilation_mode=opt -- /home/the-absolute-work-directory-that-contains-sd-v1.4.ckpt-file "a photograph of an astronaut riding a horse"
+```
+
+The image will be generated under the given directory with name `txt2img.png`.
diff --git a/examples/txt2img/main.swift b/examples/txt2img/main.swift
@@ -563,8 +563,11 @@ let unconditional_batch_encoding = tokenizer(
   return_length: true, return_overflowing_tokens: false, padding: "max_length", return_tensors: "pt"
 )
 
+let workDir = CommandLine.arguments[1]
+let text = CommandLine.arguments.suffix(2).joined(separator: " ")
+
 let batch_encoding = tokenizer(
-  ["a photograph of an astronaut riding a horse"], truncation: true, max_length: 77,
+  [text], truncation: true, max_length: 77,
   return_length: true, return_overflowing_tokens: false, padding: "max_length", return_tensors: "pt"
 )
 
@@ -622,7 +625,7 @@ graph.withNoGrad {
   let positionTensorGPU = positionTensor.toGPU(0)
   let casualAttentionMaskGPU = casualAttentionMask.toGPU(0)
   let _ = textModel(inputs: tokensTensorGPU, positionTensorGPU, casualAttentionMaskGPU)
-  graph.openStore("/home/liu/workspace/swift-diffusion/text_model.ckpt") {
+  graph.openStore(workDir + "/sd-v1.4.ckpt") {
     $0.read("text_model", model: textModel)
   }
   let c = textModel(inputs: tokensTensorGPU, positionTensorGPU, casualAttentionMaskGPU)[0].as(
@@ -633,11 +636,9 @@ graph.withNoGrad {
   var x = x_T
   var xIn = graph.variable(.GPU(0), .NCHW(2, 4, 64, 64), of: Float.self)
   let _ = unet(inputs: xIn, graph.variable(ts[0]), c)
-  graph.openStore("/home/liu/workspace/swift-diffusion/unet.ckpt") {
-    $0.read("unet", model: unet)
-  }
   let _ = decoder(inputs: x)
-  graph.openStore("/home/liu/workspace/swift-diffusion/autoencoder.ckpt") {
+  graph.openStore(workDir + "/sd-v1.4.ckpt") {
+    $0.read("unet", model: unet)
     $0.read("decoder", model: decoder)
   }
   let alphasCumprod = model.alphasCumprod
@@ -707,7 +708,7 @@ graph.withNoGrad {
         min(max(Int(Float((b + 1) / 2) * 255), 0), 255))
     }
   }
-  let _ = "/home/liu/workspace/swift-diffusion/txt2img.png".withCString {
+  let _ = (workDir + "/txt2img.png").withCString {
     ccv_write(image, UnsafeMutablePointer(mutating: $0), nil, Int32(CCV_IO_PNG_FILE), nil)
   }
 }