Commit 955dfc5
authored
ci: faster compile/ci (#305)
Nvcc compilation profile has changed drastically now that
`gqa_group_size` is an input arg and no longer a template parameter.
This PR improves compile time by ~20% on my dev machine. Result may vary
due to diff env but I expect a net positive overall.
Env: 13900K PCores: 5.6GHz + ECores: 4.0GHz (Both are OCed). Total of 32
hw threads.
TEST: use scripts/run-ci-build-wheel.sh and time compile to step(20)
completion.
```
env FLASHINFER_CI_PYTHON_VERSION=3.11 FLASHINFER_CI_TORCH_VERSION=2.3.1 FLASHINFER_CI_CUDA_VERSION=12.4 FLASHINFER_BUILD_VERSION=0.0.4 TORCH_CUDA_ARCH_LIST=“8.0;8.6;8.9"
```
```
nvcc_threads=8 41.01s to step20 MAX_JOBS=16 <-- current default
nvcc_threads=2 41.21s to step20 MAX_JOBS=16
nvcc_threads=1 50.97s to step20 MAX_JOBS=16
nvcc_threads=4 40.83s to step20 MAX_JOBS=16
nvcc_threads=4 1m15s to step20 MAX_JOBS=8
nvcc_threads=1 32s to step20 MAX_JOBS=32 <-- fastest (PR)
nvcc_threads=2 38s to step20 MAX_JOBS=32
```
Based on the tests, main now favors processes/jobs vs threads for nvcc.1 parent c507156 commit 955dfc5
1 file changed
+2
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
326 | 326 | | |
327 | 327 | | |
328 | 328 | | |
329 | | - | |
| 329 | + | |
330 | 330 | | |
331 | 331 | | |
332 | 332 | | |
| |||
367 | 367 | | |
368 | 368 | | |
369 | 369 | | |
370 | | - | |
| 370 | + | |
371 | 371 | | |
372 | 372 | | |
373 | 373 | | |
| |||
0 commit comments