What problem are you facing?
We'd like to see GLM-5.2 run on Modelplane.
zai-org/GLM-5.2. 753B total / 32B active MoE, native BF16 weights (~1.5 TB), 256 routed experts (8 active) plus 1 shared, DeepSeek-style sparse attention, 1M context. The repo is public and MIT-licensed, so no Hugging Face token is needed. Its architecture, glm_moe_dsa (GlmMoeDsaForCausalLM), reuses the DeepSeek-V2/V3 lineage. An FP8 build, zai-org/GLM-5.2-FP8 (~0.75 TB), fits a single 8x H200 node, but the BF16 build across two nodes is the target here.
Note for BF16 on two 8x H200 nodes we'll barely be able to fit any context - 32k or so.
How could Modelplane help solve your problem?
GPU footprint. 16x NVIDIA H200 across two nodes, 2x EKS p5en.48xlarge (8x H200 141 GiB each, 1128 GiB/node, 3200 Gbps EFA, placed in an EC2 UltraCluster; spec). The BF16 weights are ~1.5 TB; at ~127 GiB usable per H200 (90% of 141 GiB) one 8-GPU node (~1.0 TB) can't hold them, so two nodes (~2.0 TB usable) are the floor. That leaves ~0.5 TB across the gang for KV cache and activations, ample for a single smoke-test request at a capped context. The FP8 build (~0.75 TB) would fit one node.
Topology. One engine as a Leader plus Worker gang (worker.nodes: 1), eight H200 per node, spanning the two nodes via LeaderWorkerSet. Parallelism is the engine's concern. vLLM recommends TP = GPUs-per-node, PP = nodes for multi-node, so --tensor-parallel-size=8 --pipeline-parallel-size=2 across the gang, written as vLLM flags on each member, with the worker addressing the leader through the injected $(MODELPLANE_LEADER_ADDRESS). Tool calling needs --enable-auto-tool-choice plus the GLM tool-call parser. --reasoning-parser glm45 is the established reasoning parser for the GLM family; the tool-call parser is either glm45 (proven for GLM-4.5/4.6) or glm47 (the newer GLM tool-call format), so try glm45 first and fall back to glm47 if tool calls leak into content. Include --trust-remote-code, and cap context with --max-model-len since the native 1M won't fit (start ~32K). Multi-node engines require a ModelCache: ~1.5 TB from the public Hugging Face repo.
Success criteria.
What problem are you facing?
We'd like to see GLM-5.2 run on Modelplane.
zai-org/GLM-5.2. 753B total / 32B active MoE, native BF16 weights (~1.5 TB), 256 routed experts (8 active) plus 1 shared, DeepSeek-style sparse attention, 1M context. The repo is public and MIT-licensed, so no Hugging Face token is needed. Its architecture,glm_moe_dsa(GlmMoeDsaForCausalLM), reuses the DeepSeek-V2/V3 lineage. An FP8 build,zai-org/GLM-5.2-FP8(~0.75 TB), fits a single 8x H200 node, but the BF16 build across two nodes is the target here.Note for BF16 on two 8x H200 nodes we'll barely be able to fit any context - 32k or so.
How could Modelplane help solve your problem?
GPU footprint. 16x NVIDIA H200 across two nodes, 2x EKS
p5en.48xlarge(8x H200 141 GiB each, 1128 GiB/node, 3200 Gbps EFA, placed in an EC2 UltraCluster; spec). The BF16 weights are ~1.5 TB; at ~127 GiB usable per H200 (90% of 141 GiB) one 8-GPU node (~1.0 TB) can't hold them, so two nodes (~2.0 TB usable) are the floor. That leaves ~0.5 TB across the gang for KV cache and activations, ample for a single smoke-test request at a capped context. The FP8 build (~0.75 TB) would fit one node.Topology. One engine as a
LeaderplusWorkergang (worker.nodes: 1), eight H200 per node, spanning the two nodes via LeaderWorkerSet. Parallelism is the engine's concern. vLLM recommends TP = GPUs-per-node, PP = nodes for multi-node, so--tensor-parallel-size=8 --pipeline-parallel-size=2across the gang, written as vLLM flags on each member, with the worker addressing the leader through the injected$(MODELPLANE_LEADER_ADDRESS). Tool calling needs--enable-auto-tool-choiceplus the GLM tool-call parser.--reasoning-parser glm45is the established reasoning parser for the GLM family; the tool-call parser is eitherglm45(proven for GLM-4.5/4.6) orglm47(the newer GLM tool-call format), so tryglm45first and fall back toglm47if tool calls leak into content. Include--trust-remote-code, and cap context with--max-model-lensince the native 1M won't fit (start ~32K). Multi-node engines require aModelCache: ~1.5 TB from the public Hugging Face repo.Success criteria.
ModelCachehydrates from the public repo and reports Ready on the target cluster.ModelService's OpenAI-compatible endpoint returns a response.glm45orglm47).