Some cursory perf analysis.

timdesrochers · Sep 14, 2022 · 4f476f4 · 4f476f4
1 parent b39f29d
commit 4f476f4
Show file tree

Hide file tree

Showing 3 changed files with 5 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -22,4 +22,4 @@ Converting the model to FP16 would save memory footprint instantly, but this wil
 
 ## Is It Comparable?
 
-Right now, I didn't run any specific optimizations. Further, the model loading as of today for s4nnc requires executing the model once, and we have some optimization runs (find the most efficient kernels etc.) that are not saved. That has been said, we can compare the execution time of txt2img from Swift v.s. the one from CompVis (there are more optimized forks available, but going through them to find the best would take time) of the diffusion process + decoding process. The Swift txt2img on GPU took about 20s while the CompVis took about 11s (both with one 2080 Ti). I haven't done full analysis on where the slowness is from, but likely on the GroupNorm operator.
+Right now, I didn't run any specific optimizations. Further, the model loading as of today for s4nnc requires executing the model once, and we have some optimization runs (find the most efficient kernels etc.) that are not saved. That has been said, we can compare the execution time of txt2img from Swift v.s. the one from CompVis (there are more optimized forks available, but going through them to find the best would take time) of the diffusion process + decoding process. The Swift txt2img on GPU took about 17s while the CompVis took about 11s (both with one 2080 Ti). Cursory look at `nvprof` output shows that transpose and not using cublasLt the leading cause for the extra 6s spent.
diff --git a/WORKSPACE b/WORKSPACE
@@ -3,9 +3,9 @@ load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")
 
 git_repository(
     name = "s4nnc",
-    commit = "82c18635de0b46ecca7439099f19ba657cb0db77",
+    commit = "95b7b5f94f1b4385c67c065457710a674de77f06",
     remote = "https://github.com/liuliu/s4nnc.git",
-    shallow_since = "1663118440 -0400",
+    shallow_since = "1663197056 -0400",
 )
 
 load("@s4nnc//:deps.bzl", "s4nnc_deps")

diff --git a/examples/txt2img/main.swift b/examples/txt2img/main.swift
@@ -615,6 +615,8 @@ func xPrevAndPredX0(
   return (xPrev, predX0)
 }
 
+graph.workspaceSize = 1_024 * 1_024 * 1_024
+
 graph.withNoGrad {
   let tokensTensorGPU = tokensTensor.toGPU(0)
   let positionTensorGPU = positionTensor.toGPU(0)
Original file line number	Diff line number	Diff line change
Expand Up		@@ -22,4 +22,4 @@ Converting the model to FP16 would save memory footprint instantly, but this wil

		## Is It Comparable?

		Right now, I didn't run any specific optimizations. Further, the model loading as of today for s4nnc requires executing the model once, and we have some optimization runs (find the most efficient kernels etc.) that are not saved. That has been said, we can compare the execution time of txt2img from Swift v.s. the one from CompVis (there are more optimized forks available, but going through them to find the best would take time) of the diffusion process + decoding process. The Swift txt2img on GPU took about 20s while the CompVis took about 11s (both with one 2080 Ti). I haven't done full analysis on where the slowness is from, but likely on the GroupNorm operator.
		Right now, I didn't run any specific optimizations. Further, the model loading as of today for s4nnc requires executing the model once, and we have some optimization runs (find the most efficient kernels etc.) that are not saved. That has been said, we can compare the execution time of txt2img from Swift v.s. the one from CompVis (there are more optimized forks available, but going through them to find the best would take time) of the diffusion process + decoding process. The Swift txt2img on GPU took about 17s while the CompVis took about 11s (both with one 2080 Ti). Cursory look at `nvprof` output shows that transpose and not using cublasLt the leading cause for the extra 6s spent.