Skip to content

Crash when enable MTP on KT 0.6.2 DeepSeekV4Flash #2009

@mrgaolei

Description

@mrgaolei

Reminder

  • I have read the above rules and searched the existing issues.

System Info

When I add MTP to startup script:

  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --speculative-moe-runner-backend auto \

It crash, but without that, everything OK.

This is full script:

export TORCH_CUDA_ARCH_LIST="12.0+PTX"
export SGLANG_DSV4_MODE=2604
export SGLANG_DSV4_2604_SUBMODE=2604B

numactl --interleave=all python -m sglang.launch_server \
  --served-model-name DeepSeek-V4-Flash \
  --host 0.0.0.0 --port 30000 \
  --model /mnt/nvme0/ai-models/DeepSeek-V4-Flash/ \
  --kt-weight-path /mnt/nvme0/ai-models/DeepSeek-V4-Flash \
  --kt-method MXFP4 \
  --kt-num-gpu-experts 10 \
  --kt-cpuinfer 35 \
  --kt-threadpool-count 2 \
  --kt-gpu-prefill-token-threshold 4096 \
  --kt-enable-dynamic-expert-update \
  --tensor-parallel-size 1 \
  --context-length 32768 \
  --attention-backend flashinfer \
  --mem-fraction-static 0.85 \
  --chunked-prefill-size 2048 \
  --max-prefill-tokens 2048 \
  --max-running-requests 2 \
  --watchdog-timeout 1200 \
  --disable-shared-experts-fusion \
  --trust-remote-code \
  --cuda-graph-bs 1 \
  --cuda-graph-max-bs 1 \
  --disable-radix-cache \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --speculative-moe-runner-backend auto \
  --skip-server-warmup

Reproduction

[2026-05-16 14:59:21] INFO:     Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)                                                                    /home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/site-packages/fastapi/routing.py:120: FastAPIDeprecationWarning: ORJSONResponse is deprecated, FastAPI now serializes data directly to JSON bytes via Pydantic when a return type or response model is set, which is faster and doesn't need a custom response class. Read more in the FastAPI docs: https://fastapi.tiangolo.com/advanced/custom-response/#orjson-or-response-model and https://fastapi.tiangolo.com/tutorial/response-model/     response = await f(request)                                                                                                                                     [2026-05-16 14:59:49] INFO:     172.17.0.2:44190 - "GET /v1/models HTTP/1.1" 200 OK                                                                               [2026-05-16 15:00:01] INFO:     172.17.0.2:44256 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                    Fatal Python error: Segmentation fault                                                                                                                                                                                                                                                                                              Thread 0x0000701261fff6c0 (most recent call first):                                                                                                                 File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/site-packages/sglang/srt/utils/watchdog.py", line 145 in _watchdog_once                                File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/site-packages/sglang/srt/utils/watchdog.py", line 125 in _watchdog_thread
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/threading.py", line 1012 in run
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/threading.py", line 1032 in _bootstrap

Thread 0x0000701dfbfff6c0 (most recent call first):
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 73 in _recv_msg
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 228 in _read_thread
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/threading.py", line 1012 in run
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/threading.py", line 1032 in _bootstrap

Thread 0x0000701df7ffe6c0 (most recent call first):
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/threading.py", line 359 in wait
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/threading.py", line 655 in wait
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/site-packages/tqdm/_monitor.py", line 60 in run
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/threading.py", line 1032 in _bootstrap

Thread 0x0000701e3bfff6c0 (most recent call first):
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/threading.py", line 359 in wait
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/threading.py", line 655 in wait
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/site-packages/tqdm/_monitor.py", line 60 in run
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/threading.py", line 1032 in _bootstrap

Thread 0x0000704ef8876740 (most recent call first):
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/site-packages/torch/cuda/graphs.py", line 141 in replay
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 1074 in replay
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 2541 in _forward_raw
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 2456 in forward
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 456 in forward_batch_generation
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/site-packages/sglang/srt/speculative/eagle_worker_v2.py", line 757 in verify
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/site-packages/sglang/srt/speculative/eagle_worker_v2.py", line 697 in forward_batch_generation
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 2405 in run_batch
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 1176 in event_loop_overlap
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120 in decorate_context
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 3245 in run_scheduler_process
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/multiprocessing/process.py", line 108 in run
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/multiprocessing/process.py", line 314 in _bootstrap
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/multiprocessing/spawn.py", line 135 in _main
  File "/home/aigao/miniconda3/envs/kt-kernel/lib/python3.12/multiprocessing/spawn.py", line 122 in spawn_main
  File "<string>", line 1 in <module>

Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, pybase64._pybase64, charset_normalizer.md, charset_normalizer.cd, requests.packages.
charset_normalizer.md, requests.packages.chardet.md, requests.packages.charset_normalizer.cd, requests.packages.chardet.cd, psutil._psutil_linux, multidict._multi
dict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenl
ist, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg,
torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, zmq.backend.cython._zmq, PIL._imaging, sentencepiece._sentencepiece, yaml._yaml, regex._regex
, markupsafe._speedups, cuda_utils, PIL._imagingft, av._core, av.logging, av.bytesource, av.buffer, av.audio.format, av.error, av.dictionary, av.container.pyio, a
v.option, av.descriptor, av.format, av.utils, av.stream, av.container.streams, av.sidedata.encparams, av.sidedata.motionvectors, av.sidedata.sidedata, av.opaque,
av.packet, av.container.input, av.container.output, av.container.core, av.codec.context, av.video.format, av.video.reformatter, av.plane, av.video.plane, av.video
.frame, av.video.stream, av.codec.hwaccel, av.codec.codec, av.frame, av.audio.layout, av.audio.plane, av.audio.frame, av.audio.stream, av.filter.link, av.filter.c
ontext, av.filter.graph, av.filter.filter, av.filter.loudnorm, av.audio.resampler, av.audio.codeccontext, av.audio.fifo, av.bitstream, av.video.codeccontext, nump
y.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._pcg64, numpy.random._generator, numpy.random._mt19937, numpy.random._p
hilox, numpy.random._sfc64, numpy.random.mtrand, _cffi_backend, _cyutility, scipy._cyutility, scipy._lib._ccallback_c, scipy.linalg._fblas, scipy.linalg._flapack,
 scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._batched_linalg, scipy.linalg._decomp_lu_cython, sci
py.linalg._matfuncs_schur_sqrtm, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse._s
parsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpacklib, scipy.sparse.linalg._pr
opack, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy
.optimize._slsqplib, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx
, scipy.special._ellip_harm_2, scipy.special._special_ufuncs, scipy.special._gufuncs, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.li
nalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial.
_hausdorff, scipy.spatial._distance_wrap, scipy.spatial.transform._rotation_cy, scipy.spatial.transform._rigid_transform_cy, scipy.optimize._direct, setproctitle.
_setproctitle, cuda.bindings._bindings.cydriver, cuda.bindings.cydriver, cuda.bindings.driver, tvm_ffi.core, msgspec._core, grpc._cython.cygrpc, google._upb._mess
age, cuda.bindings._bindings.cynvrtc, cuda.bindings.cynvrtc, cuda.bindings.nvrtc, cuda.bindings._bindings.cyruntime_ptds, cuda.bindings._bindings.cyruntime, cuda.
bindings.cyruntime, cuda.bindings.runtime, cuda.tile._cext, _xxsubinterpreters, tilelang_cython_wrapper, __triton_launcher, uvloop.loop (total: 163)
!!!!!!! Segfault encountered !!!!!!!
  File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x0000704ef864532f
  File "../../../../../libstdc++-v3/src/c++11/thread.cc", line 104, in execute_native_thread_routine
  File "./nptl/pthread_create.c", line 447, in start_thread
  File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 78, in clone3
  File "<unknown>", line 0, in 0xffffffffffffffff

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions