Skip to content

Commit 955dfc5

Browse files
authored
ci: faster compile/ci (#305)
Nvcc compilation profile has changed drastically now that `gqa_group_size` is an input arg and no longer a template parameter. This PR improves compile time by ~20% on my dev machine. Result may vary due to diff env but I expect a net positive overall. Env: 13900K PCores: 5.6GHz + ECores: 4.0GHz (Both are OCed). Total of 32 hw threads. TEST: use scripts/run-ci-build-wheel.sh and time compile to step(20) completion. ``` env FLASHINFER_CI_PYTHON_VERSION=3.11 FLASHINFER_CI_TORCH_VERSION=2.3.1 FLASHINFER_CI_CUDA_VERSION=12.4 FLASHINFER_BUILD_VERSION=0.0.4 TORCH_CUDA_ARCH_LIST=“8.0;8.6;8.9" ``` ``` nvcc_threads=8 41.01s to step20 MAX_JOBS=16 <-- current default nvcc_threads=2 41.21s to step20 MAX_JOBS=16 nvcc_threads=1 50.97s to step20 MAX_JOBS=16 nvcc_threads=4 40.83s to step20 MAX_JOBS=16 nvcc_threads=4 1m15s to step20 MAX_JOBS=8 nvcc_threads=1 32s to step20 MAX_JOBS=32 <-- fastest (PR) nvcc_threads=2 38s to step20 MAX_JOBS=32 ``` Based on the tests, main now favors processes/jobs vs threads for nvcc.
1 parent c507156 commit 955dfc5

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

python/setup.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -326,7 +326,7 @@ class NinjaBuildExtension(torch_cpp_ext.BuildExtension):
326326
def __init__(self, *args, **kwargs) -> None:
327327
# do not override env MAX_JOBS if already exists
328328
if not os.environ.get("MAX_JOBS"):
329-
max_num_jobs_cores = max(1, os.cpu_count() // 2)
329+
max_num_jobs_cores = max(1, os.cpu_count())
330330
os.environ["MAX_JOBS"] = str(max_num_jobs_cores)
331331

332332
super().__init__(*args, **kwargs)
@@ -367,7 +367,7 @@ def __init__(self, *args, **kwargs) -> None:
367367
"-O3",
368368
"-std=c++17",
369369
"--threads",
370-
"8",
370+
"1",
371371
"-Xfatbin",
372372
"-compress-all",
373373
],

0 commit comments

Comments
 (0)