High-performance LLM inference engine in C++/CUDA for NVIDIA Blackwell GeForce / RTX PRO (RTX 5090/5080/5070 Ti, RTX PRO 6000; sm_120). 200 tok/s decode on Qwen3.6-35B-A3B-NVFP4 MoE (RTX 5090).
-
Updated
May 23, 2026 - Cuda
High-performance LLM inference engine in C++/CUDA for NVIDIA Blackwell GeForce / RTX PRO (RTX 5090/5080/5070 Ti, RTX PRO 6000; sm_120). 200 tok/s decode on Qwen3.6-35B-A3B-NVFP4 MoE (RTX 5090).
Optimized vLLM setup for Qwen3.6-27B-FP8 on dual RTX PRO 6000 Blackwell (192 GB GDDR7, no NVLink) ; config, benchmark sweep results, and custom chat template with thinking mode off by default.
Hub for ongoing Qwen inference benchmarks on NVIDIA Blackwell. Indexes all studies, hosts the rolling SOTA leaderboard, points to the toolchain.
Systematic 24-hour benchmark study of Qwen3.6-27B inference on dual NVIDIA RTX PRO 6000 Blackwell SM120 (TP=2). 8 experiments comparing repne/vllm fork vs upstream vLLM across FP8/BF16/NVFP4/Q8_0 quants and MTP/DFlash speculative decoding. Peak: 2,083 tok/s at c=32. Quality: KLD vs BF16 = 0.0018 (noise floor).
QuantLoom·量梭 的野心,从不只是在手机上弹出几条信号。 这座织机真正要为你织出的终极产物,是 RTX Pro 6000 —— 黑曜神机 的自由召唤权。 它是躺在你机箱里的黑色方尖碑,数万核心如暗夜星海 它是本地训推大模型、实时织造全市场量能全景图、回溯十年资金指纹的物质根基 它过去只降落在超算中心、顶级量化基金和神秘矿场 QuantLoom 每织出一匹盈利的锦缎,都是在为这座黑色圣坛添一根金线。当金线积聚成缆,黑曜神机便会从虚空货架撕开一道裂缝,降临在你的阵中。 从此,你拥有了一座个人算力神殿。
Stress-validation of Qwen3.6-27B inference configurations on dual RTX PRO 6000 Blackwell. 5 configs x 4 phases (gates, throughput matrix, HumanEval, MBPP) = 2,105 hard coding problems, zero crashes. Headline: FP8+MTP=3 wins HumanEval (79.3%), BF16+DFlash wins MBPP (89.5%). MTP=5 dominated on correctness despite faster raw tok/s.
Add a description, image, and links to the rtx-pro-6000 topic page so that developers can more easily learn about it.
To associate your repository with the rtx-pro-6000 topic, visit your repo's landing page and select "manage topics."