H200 部署 GLM 5.1

### Reminder

- [x] I have read the above rules and searched the existing issues.

### System Info

我在一台离线的服务器上部署 ktransformers，使用的是docker，docker image 是 approachingai/ktransformers:v0.5.3，服务器上有两台 H200, 服务器内存为 386G，我准备尝试部署GLM 5.1,如果不够我将加内存，可是没有跑通，下面是我的运行命令
export PYTORCH_ALLOC_CONF=expandable_segments:True
export SGLANG_ENABLE_JIT_DEEPGEMM=0

python -m sglang.launch_server \
  --host 0.0.0.0 \
  --port 30000 \
  --model /code/GLM-5.1-FP8 \
  --kt-weight-path /code/GLM-5.1-FP8 \
  --kt-cpuinfer 96 \
  --kt-threadpool-count 2 \
  --kt-num-gpu-experts 72 \
  --kt-method FP8 \
  --kt-gpu-prefill-token-threshold 1024 \
  --kt-enable-dynamic-expert-update \
  --kt-expert-placement-strategy uniform \
  --trust-remote-code \
  --mem-fraction-static 0.85 \
  --served-model-name GLM5.1 \
  --enable-mixed-chunk \
  --tensor-parallel-size 2 \
  --enable-p2p-check \
  --disable-shared-experts-fusion \
  --chunked-prefill-size 16384 \
  --max-running-requests 4 \
  --max-total-tokens 128000 \
  --attention-backend flashinfer \
  --kv-cache-dtype bf16 \
  --fp8-gemm-backend cutlass \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --watchdog-timeout 30000
运行完命令后会提示：
===========In NumaPool============
In Numa Worker Pool at NUMA 0, 48 threads
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
===========In NumaPool============
In Numa Worker Pool at NUMA 0, 48 threads
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
但不会报错，
接下来是报错内容：
[NativeMoEWrapper Layer 42] load_experts: 298.6ms, prepare_tensors: 0.0ms, build_ptrs: 0.4ms, create_moe: 9.6ms, cpp_load_weights: 5712.6ms, cleanup: 0.4ms, total: 6021.7ms
TP MOE layer 43, pool: 0x64386eb51e40, expert num: 256, num_experts_per_tok: 8
Created AMX_FP8_MOE_TP 0 at numa 0
alloc 1 from other numa for 7a66e7c047d0
Created AMX_FP8_MOE_TP 1 at numa 0
[rank1]:[E521 08:23:52.969958665 ProcessGroupGloo.cpp:71] Rank 1 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[2026-05-21 08:23:52] Rank 0 scheduler is dead. Please check if there are relevant logs.
[rank1]:[W521 08:23:52.987245866 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=57, addr=[localhost]:46838, remote=[localhost]:56731): Connection reset by peer
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:679 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7c84e8570b80 in /opt/miniconda3/envs/serve/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ffc5b1 (0x7c852a8cd5b1 in /opt/miniconda3/envs/serve/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5ffda13 (0x7c852a8cea13 in /opt/miniconda3/envs/serve/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5ffe55a (0x7c852a8cf55a in /opt/miniconda3/envs/serve/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x31e (0x7c852a8ca27e in /opt/miniconda3/envs/serve/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c8 (0x7c84e9449868 in /opt/miniconda3/envs/serve/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdf0e6 (0x7c86c74b40e6 in /opt/miniconda3/envs/serve/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x9caa4 (0x7c86ca077aa4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: __clone + 0x44 (0x7c86ca104a64 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W521 08:23:52.001241019 ProcessGroupNCCL.cpp:1771] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Connection reset by peer
[2026-05-21 08:23:52 TP1] Scheduler hit an exception: Traceback (most recent call last):
  File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/managers/scheduler.py", line 3118, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/managers/scheduler.py", line 363, in __init__
    self.init_model_worker()
  File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/managers/scheduler.py", line 559, in init_model_worker
    self.init_tp_model_worker()
  File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/managers/scheduler.py", line 517, in init_tp_model_worker
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^^
  File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/managers/tp_worker.py", line 247, in __init__
    self._init_model_runner()
  File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/managers/tp_worker.py", line 330, in _init_model_runner
    self._model_runner = ModelRunner(
                         ^^^^^^^^^^^^
  File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/model_executor/model_runner.py", line 415, in __init__
    self.initialize(min_per_gpu_memory)
  File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/model_executor/model_runner.py", line 495, in initialize
    self.load_model()
  File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/model_executor/model_runner.py", line 1073, in load_model
    raise ValueError(
ValueError: TP rank 1 could finish the model loading, but there are other ranks that didn't finish loading. It is likely due to unexpected failures (e.g., OOM) or a slow node.

[2026-05-21 08:23:52] Received sigquit from a child process. It usually means the child failed.
glm_51.sh: line 32:  4371 Killed                  python -m sglang.launch_server --host 0.0.0.0 --port 30000 --model /code/GLM-5.1-FP8 --kt-weight-path /code/GLM-5.1-FP8 --kt-cpuinfer 96 --kt-threadpool-count 2 --kt-num-gpu-experts 72 --kt-method FP8 --kt-gpu-prefill-token-threshold 1024 --kt-enable-dynamic-expert-update --kt-expert-placement-strategy uniform --trust-remote-code --mem-fraction-static 0.85 --served-model-name GLM5.1 --enable-mixed-chunk --tensor-parallel-size 2 --enable-p2p-check --disable-shared-experts-fusion --chunked-prefill-size 16384 --max-running-requests 4 --max-total-tokens 128000 --attention-backend flashinfer --kv-cache-dtype bf16 --fp8-gemm-backend cutlass --tool-call-parser glm47 --reasoning-parser glm45 --watchdog-timeout 30000
我不知道我该怎么办了，帮帮我吧

### Reproduction

```text
Put your message here.
```


### Others

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

H200 部署 GLM 5.1 #2019

Reminder

System Info

Reproduction

Others

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

H200 部署 GLM 5.1 #2019

Description

Reminder

System Info

Reproduction

Others

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions