GLM-5.2-NVFP4-REAP-469B serving on SM120 (4× RTX PRO 6000 Blackwell) — one-command vLLM launch recipe, 250K context, DeepSeek Sparse Attention + MTP speculative decode
-
Updated
Jun 19, 2026 - Shell
GLM-5.2-NVFP4-REAP-469B serving on SM120 (4× RTX PRO 6000 Blackwell) — one-command vLLM launch recipe, 250K context, DeepSeek Sparse Attention + MTP speculative decode
From-scratch C++/CUDA inference engine for the NVIDIA RTX 5090 (sm_120a) — the best single-GPU backend for agentic AI: tool calling, long-context loops, reasoning and concurrent sub-agents on top of the fastest single-stream decode on the 5090 (beats llama.cpp, at-or-ahead of vLLM on NVFP4). 100% written by Claude Code.
NVFP4 inference on Blackwell GeForce (RTX 5090/5080/5070 Ti/RTX PRO 6000) — SM120 patches for vLLM + FlashInfer + CUTLASS. 175 tok/s on Qwen3.6-35B MoE.
Reproducible recipe: serve abliterated Gemma-4-12B (gemma4_unified) at 50-118 tok/s on no-NVLink Blackwell (SM120) via vLLM nightly + ModelOpt FP8/NVFP4 + MTP spec-decode.
Lna-Lab production pipeline: GGUF -> modelopt-format NVFP4 + working MTP head for vLLM on RTX PRO 6000 Blackwell (SM120). Stages 2 (NVFP4) and 3 (MTP graft) are Lna-Lab originals; stage 1 (GGUF->bf16) reuses li-yifei/gguf-to-nvfp4.
Rust-native MoE inference runtime with custom CUDA kernels for Blackwell GPUs. Includes DFlash speculative decoding, multi-tier Engram memory, and entropy-adaptive routing. Targets Qwen3.5-35B-A3B on a single RTX 5060 Ti 16GB.
Optimized vLLM deployment for NVIDIA Blackwell (RTX 5090) on Linux Kernel 6.14. Resolves SM_120 kernel incompatibilities, P2P deadlocks, and memory fragmentation for high-performance LLM inference.
Production-grade FlashAttention FP8 e4m3 forward kernel for NVIDIA Blackwell consumer GPUs (sm_120a, e.g. RTX PRO 6000). 647–652 TFLOPS at hd=128, sl=8192. Multi-kernel dispatcher, C library with Go and Python bindings
Downstream llama.cpp TurboQuant CUDA fork with adaptive KV layout selection for long-context inference on consumer Blackwell GPUs.
Pre-built onnxruntime-gpu 1.24.1 with Blackwell sm_120 CUDA kernels (RTX 5090/5080/5070)
llama.cpp fork with additional SOTA quants and improved performance
Complete installation guide for ComfyUI-Hunyuan3DWrapper on NVIDIA Blackwell GPUs (RTX 5070 Ti, 5080, 5090) Covers custom_rasterizer manual compilation for sm_120 / compute_120 architecture.
Experimental xFormers + MSLK builds validated on NVIDIA Blackwell (SM120) GPUs, CUDA 12.8, PyTorch 2.11, and RTX 5070 Laptop hardware.
Serve an abliterated Gemma-4-12B at high speeds on Blackwell GPUs without NVLink using vLLM, FP8 quantization, and MTP speculative decoding.
Add a description, image, and links to the sm120 topic page so that developers can more easily learn about it.
To associate your repository with the sm120 topic, visit your repo's landing page and select "manage topics."