I tried to get it going using this doc, and with the assistance of deepseek4 pro, but no luck. I found this doc initially which I gave it to follow https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/DeepSeek-V4-Flash.md
I found this issue here that suggests it might also be possible
#1978
here is the AI summary:
Environment: 4× RTX 3090 (SM_86), Threadripper 3970X (Zen 2, AVX2 only, no AVX-512/AMX), 256GB RAM, CUDA 13.2, ktransformers + sglang-kt from source.
What was tried and what failed:
- --kt-method MXFP4 (as per the tutorial): NativeMoEWrapper.init raises "MXFP4 backend not available. Required ISA: AVX512F + AVX512BW + AVX512_BF16". Zen 2 has no AVX-512.
- Converted MXFP4 experts → RAWINT4 (custom GPU-based dequant → requant script). 43 layers, 134GB output. Wrote weights in CompressedSafeTensorLoader format (model.layers.{L}.mlp.experts.{E}.{proj}_proj.weight_packed keys). Weights load correctly via AVX2_RAW_INT4_MOE_TP kernel — all 43 layers initialized successfully.
- --kt-method RAWINT4 + --kt-weight-path pointing to converted weights: CPU expert path works. But GPU MXFP4 expert wrap is gated by --kt-method == "MXFP4", so GPU experts fall through to FP8 path, which fails on MXFP4-packed shapes (fused_experts_impl: Hidden size mismatch). Fixed by setting SGLANG_V4_USE_TRITON_KERNELS=1.
- swiglu_limit conflict: Setting SGLANG_DSV4_2604_SUBMODE=2604B (needed by GPU MXFP4 wrap → sets swiglu_limit=10.0) crashes RAWINT4 CPU backend (rejects non-zero swiglu_limit when method != MXFP4). Fixed by patching kt_ep_wrapper to pass swiglu_limit=0.0 when method is not MXFP4.
- Server starts successfully — model loads, CUDA graphs capture, "The server is fired up and ready to roll!".
- First inference request crashes in v4_triton_kernel.py:_decode_sparse_attention_fp8_kernel: Triton Error [CUDA]: device-side assert triggered. The kernel loads from fp8 KV cache pointers and fails because the actual data is fp8e4m3fn format but on SM_86 Triton can't process fp8e4nv natively.
- Tried fp8e4nv → fp8e5 dtype hack in triton/language/core.py (patched float8e4nv = dtype('fp8e5')). This allows Triton to compile the fp8 kernels, but at runtime the decode kernel misinterprets the fp8e4m3fn data as fp8e5 format (different bit layout: 4E3M vs 5E2M), producing corrupted values that trigger CUDA asserts.
- Tried --kv-cache-dtype bfloat16 to bypass fp8 entirely. V4 model hardcodes assert kv_cache_dtype in ["fp8_e4m3"] in server_args. Removed that assertion, then hit assert kv_size == 1 — the compressed attention backend requires fp8 byte-per-element storage; bf16 (2 bytes) isn't supported.
- Tried --nsa-prefill-backend tilelang --nsa-decode-backend tilelang. The tilelang backend only handles the NSA index computation (prefill). The decode dispatch (_v4_triton_decode_dispatch) is hardcoded to call decode_sparse_attention_triton regardless of backend setting.
Summary: The CPU expert path works on AVX2 via RAWINT4. The hard blocker is V4-Flash's NSA attention decode kernel, which requires fp8e4nv — unavailable on SM_86.
Others
No response
I tried to get it going using this doc, and with the assistance of deepseek4 pro, but no luck. I found this doc initially which I gave it to follow https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/DeepSeek-V4-Flash.md
I found this issue here that suggests it might also be possible
#1978
here is the AI summary:
Environment: 4× RTX 3090 (SM_86), Threadripper 3970X (Zen 2, AVX2 only, no AVX-512/AMX), 256GB RAM, CUDA 13.2, ktransformers + sglang-kt from source.
What was tried and what failed:
Summary: The CPU expert path works on AVX2 via RAWINT4. The hard blocker is V4-Flash's NSA attention decode kernel, which requires fp8e4nv — unavailable on SM_86.
Others
No response