Skip to content

Refactor JIT compilation (+NVRTC support)#91

Closed
lucifer1004 wants to merge 14 commits into
deepseek-ai:mainfrom
lucifer1004:nvrtc
Closed

Refactor JIT compilation (+NVRTC support)#91
lucifer1004 wants to merge 14 commits into
deepseek-ai:mainfrom
lucifer1004:nvrtc

Conversation

@lucifer1004
Copy link
Copy Markdown
Collaborator

@lucifer1004 lucifer1004 commented Apr 22, 2025

With this PR, the JIT compilation time is reduced by ~60% when using NVCC, ~80% when using NVRTC w/o PCH, and ~90% when using NVRTC w/ PCH.

Benchmark on Xeon(R) Platinum 8480C + H100 HBM:

m n k compilation time (s) TFLOPS
baseline NVCC NVRTC NVRTC+PCH baseline NVCC NVRTC NVRTC+PCH
64 2112 7168 5.95 2.12 0.96 0.35 154 154 153 153
64 24576 1536 6.43 2.55 1.37 0.78 230 230 231 230
64 32768 512 6.06 2.20 0.99 0.40 180 180 180 180
64 7168 16384 6.11 2.25 1.01 0.43 289 290 289 289
64 4096 7168 5.98 2.15 0.97 0.37 207 207 206 206
64 7168 2048 6.03 2.19 1.01 0.42 188 189 187 188
128 2112 7168 5.98 2.18 0.96 0.37 283 280 281 282
128 24576 1536 6.38 2.56 1.39 0.78 440 435 434 434
128 32768 512 6.00 2.18 0.99 0.40 334 336 330 332
128 7168 16384 6.24 2.40 1.18 0.59 534 534 534 533
128 4096 7168 6.04 2.21 1.03 0.43 370 371 370 371
128 7168 2048 6.19 2.38 1.19 0.62 332 334 331 332
4096 2112 7168 6.09 2.30 1.08 0.49 1053 1058 1113 1114
4096 24576 1536 6.14 2.33 1.10 0.50 1288 1286 1290 1289
4096 32768 512 6.23 2.41 1.21 0.61 913 912 912 911
4096 7168 16384 6.31 2.51 1.27 0.68 1524 1524 1459 1458
4096 4096 7168 6.28 2.47 1.27 0.68 1446 1445 1394 1396
4096 7168 2048 6.18 2.40 1.21 0.61 1240 1243 1234 1237

Note that there is some perf drop when using NVRTC due to a known bug of NVRTC which leads to extra instructions (but in the m=4096,n=2112,k=7168 case, NVRTC version was faster, which was a bit strange). So NVCC is kept as the default compiler for now, while NVRTC can be enabled with extra env var DG_JIT_USE_NVRTC.

Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>
Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>
Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>
@LyricZhao
Copy link
Copy Markdown
Collaborator

Thanks for this contribution, this is huge! Will take some time to review.

Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>
Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>
Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>
Signed-off-by: Gabriel Wu <13583761+lucifer1004@users.noreply.github.com>
feat: add compat for older drivers and Windows
Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>
@lucifer1004

This comment was marked as outdated.

Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>
Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>
@lucifer1004
Copy link
Copy Markdown
Collaborator Author

It turned out that even older drivers can work with a newer CUDA container (my previous attempt was to use a newer CUDA installation from conda with an old driver), so I reverted to the original implementation of getting kernels.

@FlamingoPg
Copy link
Copy Markdown
Contributor

This is an excellent work. May I ask if there are any plans regarding when it will be merged? @LyricZhao

@LyricZhao
Copy link
Copy Markdown
Collaborator

This is an excellent work. May I ask if there are any plans regarding when it will be merged? @LyricZhao

I plan to merge after the 5.1 holiday.

@lucifer1004
Copy link
Copy Markdown
Collaborator Author

Moved to #94

@lucifer1004 lucifer1004 closed this May 6, 2025
LyricZhao added a commit that referenced this pull request Apr 16, 2026
* Working example

* enhancement: apply advance tech

* Fix: fix the swizzling limiting problem

* Fix: fix the test problem

* fix: diff the names of tma_copy

* fix: fix test scripts

* refa: clean up code

* refa: use template tma copy

* feat: run all tests

* fix: cleanup code

* fix: cleanup code

* fix: add assert statement

* fix: space elimination

* Some code lints

* Fix: remove one useless inst

* Remove the parallel test script

---------

Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants