Skip to content

Verify Modelplane can serve GLM-5.2 #208

Description

@negz

What problem are you facing?

We'd like to see GLM-5.2 run on Modelplane.

zai-org/GLM-5.2. 753B total / 32B active MoE, native BF16 weights (~1.5 TB), 256 routed experts (8 active) plus 1 shared, DeepSeek-style sparse attention, 1M context. The repo is public and MIT-licensed, so no Hugging Face token is needed. Its architecture, glm_moe_dsa (GlmMoeDsaForCausalLM), reuses the DeepSeek-V2/V3 lineage. An FP8 build, zai-org/GLM-5.2-FP8 (~0.75 TB), fits a single 8x H200 node, but the BF16 build across two nodes is the target here.

Note for BF16 on two 8x H200 nodes we'll barely be able to fit any context - 32k or so.

How could Modelplane help solve your problem?

GPU footprint. 16x NVIDIA H200 across two nodes, 2x EKS p5en.48xlarge (8x H200 141 GiB each, 1128 GiB/node, 3200 Gbps EFA, placed in an EC2 UltraCluster; spec). The BF16 weights are ~1.5 TB; at ~127 GiB usable per H200 (90% of 141 GiB) one 8-GPU node (~1.0 TB) can't hold them, so two nodes (~2.0 TB usable) are the floor. That leaves ~0.5 TB across the gang for KV cache and activations, ample for a single smoke-test request at a capped context. The FP8 build (~0.75 TB) would fit one node.

Topology. One engine as a Leader plus Worker gang (worker.nodes: 1), eight H200 per node, spanning the two nodes via LeaderWorkerSet. Parallelism is the engine's concern. vLLM recommends TP = GPUs-per-node, PP = nodes for multi-node, so --tensor-parallel-size=8 --pipeline-parallel-size=2 across the gang, written as vLLM flags on each member, with the worker addressing the leader through the injected $(MODELPLANE_LEADER_ADDRESS). Tool calling needs --enable-auto-tool-choice plus the GLM tool-call parser. --reasoning-parser glm45 is the established reasoning parser for the GLM family; the tool-call parser is either glm45 (proven for GLM-4.5/4.6) or glm47 (the newer GLM tool-call format), so try glm45 first and fall back to glm47 if tool calls leak into content. Include --trust-remote-code, and cap context with --max-model-len since the native 1M won't fit (start ~32K). Multi-node engines require a ModelCache: ~1.5 TB from the public Hugging Face repo.

Success criteria.

  • The ~1.5 TB ModelCache hydrates from the public repo and reports Ready on the target cluster.
  • The gang schedules across the two H200 nodes and the replica becomes ready.
  • A chat completion through the ModelService's OpenAI-compatible endpoint returns a response.
  • An agentic / tool-calling request round-trips, validating the GLM tool-call parser (glm45 or glm47).

Metadata

Metadata

Assignees

No one assigned

    Labels

    EcosystemEcosystem: Recipes, ProvidersenhancementNew feature or request

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions