ci: faster compile/ci (#305)

Qubitium · web-flow · commit 955dfc5f1188 · 2024-06-15T11:04:43.000-07:00
Nvcc compilation profile has changed drastically now that
`gqa_group_size` is an input arg and no longer a template parameter.

This PR improves compile time by ~20% on my dev machine. Result may vary
due to diff env but I expect a net positive overall.

Env: 13900K PCores: 5.6GHz + ECores: 4.0GHz (Both are OCed). Total of 32
hw threads.
TEST: use scripts/run-ci-build-wheel.sh and time compile to step(20)
completion.

```
env FLASHINFER_CI_PYTHON_VERSION=3.11 FLASHINFER_CI_TORCH_VERSION=2.3.1 FLASHINFER_CI_CUDA_VERSION=12.4 FLASHINFER_BUILD_VERSION=0.0.4 TORCH_CUDA_ARCH_LIST=“8.0;8.6;8.9"
```

```
nvcc_threads=8 41.01s to step20 MAX_JOBS=16 &lt;-- current default
nvcc_threads=2 41.21s to step20 MAX_JOBS=16
nvcc_threads=1 50.97s to step20 MAX_JOBS=16
nvcc_threads=4 40.83s to step20 MAX_JOBS=16
nvcc_threads=4 1m15s  to step20 MAX_JOBS=8
nvcc_threads=1 32s    to step20 MAX_JOBS=32 &lt;-- fastest (PR)
nvcc_threads=2 38s    to step20 MAX_JOBS=32
```
 
Based on the tests, main now favors processes/jobs vs threads for nvcc.
diff --git a/python/setup.py b/python/setup.py
@@ -326,7 +326,7 @@ class NinjaBuildExtension(torch_cpp_ext.BuildExtension):
     def __init__(self, *args, **kwargs) -> None:
         # do not override env MAX_JOBS if already exists
         if not os.environ.get("MAX_JOBS"):
-            max_num_jobs_cores = max(1, os.cpu_count() // 2)
+            max_num_jobs_cores = max(1, os.cpu_count())
             os.environ["MAX_JOBS"] = str(max_num_jobs_cores)
 
         super().__init__(*args, **kwargs)
@@ -367,7 +367,7 @@ def __init__(self, *args, **kwargs) -> None:
                     "-O3",
                     "-std=c++17",
                     "--threads",
-                    "8",
+                    "1",
                     "-Xfatbin",
                     "-compress-all",
                 ],