[Help] Possible to run deepseek4-flash on ampere (3090) + threadripper 3970x (no AMX/AVX512) ?

I tried to get it going using this doc, and with the assistance of deepseek4 pro, but no luck. I found this doc initially which I gave it to follow https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/DeepSeek-V4-Flash.md 

I found this issue here that suggests it might also be possible
https://github.com/kvcache-ai/ktransformers/issues/1978

here is the AI summary:


Environment: 4× RTX 3090 (SM_86), Threadripper 3970X (Zen 2, AVX2 only, no AVX-512/AMX), 256GB RAM, CUDA 13.2, ktransformers + sglang-kt from source.
What was tried and what failed:
1. --kt-method MXFP4 (as per the tutorial): NativeMoEWrapper.__init__ raises "MXFP4 backend not available. Required ISA: AVX512F + AVX512BW + AVX512_BF16". Zen 2 has no AVX-512.
2. Converted MXFP4 experts → RAWINT4 (custom GPU-based dequant → requant script). 43 layers, 134GB output. Wrote weights in CompressedSafeTensorLoader format (model.layers.{L}.mlp.experts.{E}.{proj}_proj.weight_packed keys). Weights load correctly via AVX2_RAW_INT4_MOE_TP kernel — all 43 layers initialized successfully.
3. --kt-method RAWINT4 + --kt-weight-path pointing to converted weights: CPU expert path works. But GPU MXFP4 expert wrap is gated by --kt-method == "MXFP4", so GPU experts fall through to FP8 path, which fails on MXFP4-packed shapes (fused_experts_impl: Hidden size mismatch). Fixed by setting SGLANG_V4_USE_TRITON_KERNELS=1.
4. swiglu_limit conflict: Setting SGLANG_DSV4_2604_SUBMODE=2604B (needed by GPU MXFP4 wrap → sets swiglu_limit=10.0) crashes RAWINT4 CPU backend (rejects non-zero swiglu_limit when method != MXFP4). Fixed by patching kt_ep_wrapper to pass swiglu_limit=0.0 when method is not MXFP4.
5. Server starts successfully — model loads, CUDA graphs capture, "The server is fired up and ready to roll!".
6. First inference request crashes in v4_triton_kernel.py:_decode_sparse_attention_fp8_kernel: Triton Error [CUDA]: device-side assert triggered. The kernel loads from fp8 KV cache pointers and fails because the actual data is fp8e4m3fn format but on SM_86 Triton can't process fp8e4nv natively.
7. Tried fp8e4nv → fp8e5 dtype hack in triton/language/core.py (patched float8e4nv = dtype('fp8e5')). This allows Triton to compile the fp8 kernels, but at runtime the decode kernel misinterprets the fp8e4m3fn data as fp8e5 format (different bit layout: 4E3M vs 5E2M), producing corrupted values that trigger CUDA asserts.
8. Tried --kv-cache-dtype bfloat16 to bypass fp8 entirely. V4 model hardcodes assert kv_cache_dtype in ["fp8_e4m3"] in server_args. Removed that assertion, then hit assert kv_size == 1 — the compressed attention backend requires fp8 byte-per-element storage; bf16 (2 bytes) isn't supported.
9. Tried --nsa-prefill-backend tilelang --nsa-decode-backend tilelang. The tilelang backend only handles the NSA index computation (prefill). The decode dispatch (_v4_triton_decode_dispatch) is hardcoded to call decode_sparse_attention_triton regardless of backend setting.
Summary: The CPU expert path works on AVX2 via RAWINT4. The hard blocker is V4-Flash's NSA attention decode kernel, which requires fp8e4nv — unavailable on SM_86.


### Others

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Help] Possible to run deepseek4-flash on ampere (3090) + threadripper 3970x (no AMX/AVX512) ? #1999

Others

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Help] Possible to run deepseek4-flash on ampere (3090) + threadripper 3970x (no AMX/AVX512) ? #1999

Description

Others

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions