Skip to content

Speculative/EAGLE - Supported? #2007

@usrlocalben

Description

@usrlocalben

Reminder

  • I have read the above rules and searched the existing issues.

System Info

ktransformers @ bb15fdf
kvcache-ai/sglang @ 2763727

Reproduction

[2026-05-15 19:00:54] Load weight end. elapsed=152.31 s, type=KimiK25ForConditionalGeneration, dtype=torch.bfloat16, avail mem=52.58 GB, mem usage=41.59 GB.
[2026-05-15 19:00:54] Using KV cache dtype: torch.bfloat16
[2026-05-15 19:00:54] KV Cache is allocated. #tokens: 260000, KV size: 17.02 GB
[2026-05-15 19:00:54] Memory pool end. avail mem=35.54 GB
[2026-05-15 19:00:54] Capture cuda graph begin. This can take up to several minutes. avail mem=35.10 GB
[2026-05-15 19:00:54] Capture cuda graph bs [1]
[2026-05-15 19:00:54] Scheduler hit an exception: Traceback (most recent call last):
  File "/home/kt-test/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 3207, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/home/kt-test/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 367, in __init__
    self.init_model_worker()
  File "/home/kt-test/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 563, in init_model_worker
    self.init_tp_model_worker()
  File "/home/kt-test/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 521, in init_tp_model_worker
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^^
  File "/home/kt-test/.venv/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 247, in __init__
    self._init_model_runner()
  File "/home/kt-test/.venv/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 330, in _init_model_runner
    self._model_runner = ModelRunner(
                         ^^^^^^^^^^^^
  File "/home/kt-test/.venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 418, in __init__
    self.initialize()
  File "/home/kt-test/.venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 633, in initialize
    self.init_device_graphs()
  File "/home/kt-test/.venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 2211, in init_device_graphs
    self.graph_runner = graph_runners[self.device](self)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kt-test/.venv/lib/python3.12/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 562, in __init__
    self.model_runner.model.set_eagle3_layers_to_capture()
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kt-test/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1964, in __getattr__
    raise AttributeError(
AttributeError: 'KimiK25ForConditionalGeneration' object has no attribute 'set_eagle3_layers_to_capture'

[2026-05-15 19:00:54] Received sigquit from a child process. It usually means the child failed.
test.sh: line 38: 1787672 Killed                  python -m sglang.launch_server --host 0.0.0.0 --port 31245 --model /mnt/aux/aux/model/Kimi-K2.6/moonshotai --kt-weight-path /mnt/aux/aux/model/Kimi-K2.6/moonshotai --kt-cpuinfer 96 --kt-threadpool-count 8 --kt-num-gpu-experts 12 --kt-method RAWINT4 --kt-max-deferred-experts-per-token 2 --kt-gpu-prefill-token-threshold 1200 --kt-enable-dynamic-expert-update --attention-backend flashinfer --trust-remote-code --mem-fraction-static 0.94 --context-length 200000 --max-running-requests 1 --prefill-max-requests 1 --max-total-tokens 200000 --enable-mixed-chunk --served-model-name Kimi-K2.6 --enable-p2p-check --disable-shared-experts-fusion --chunked-prefill-size 32768 --tool-call-parser kimi_k2 --reasoning-parser kimi_k2 --enable-hierarchical-cache --hicache-ratio 2 --hicache-size 0 --skip-server-warmup --speculative-algorithm EAGLE3 --speculative-draft-model-path /mnt/aux/aux/model/Kimi-K26-eagle3/AQ-MedAI --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --sleep-on-idle

Others

I'd like to use/try EAGLE3 w/Kimi, e.g. the EAGLE3 model provided by AQ-MedAI.

Their model-card indicates it should work with sglang 0.5.10 (maybe others, it doesn't say it's a minimum nor maximum)

Expected: Works.
Observed: Does not work. (See trace above)

I tried transformers 5.8.0, 5.7.0, and 5.6.2, although the problem appears to be in the model implementation in sglang/src/models/kimi_k25 .

Additionally, since kt vendors the sglang module, I can't file any report there.

Additionally2, there is little clarity wrt. differences between mainline sglang and kt sglang. There are pulls/commits that "merge" from mainline but it's still vague in terms of what to expect.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions