Verify Modelplane can serve GLM-5.2

### What problem are you facing?

We'd like to see GLM-5.2 run on Modelplane.

[`zai-org/GLM-5.2`](https://huggingface.co/zai-org/GLM-5.2). 753B total / 32B active MoE, native BF16 weights (~1.5 TB), 256 routed experts (8 active) plus 1 shared, DeepSeek-style sparse attention, 1M context. The repo is public and MIT-licensed, so no Hugging Face token is needed. Its architecture, `glm_moe_dsa` (`GlmMoeDsaForCausalLM`), reuses the DeepSeek-V2/V3 lineage. An FP8 build, [`zai-org/GLM-5.2-FP8`](https://huggingface.co/zai-org/GLM-5.2-FP8) (~0.75 TB), fits a single 8x H200 node, but the BF16 build across two nodes is the target here.

Note for BF16 on two 8x H200 nodes we'll barely be able to fit any context - 32k or so.

### How could Modelplane help solve your problem?

**GPU footprint.** 16x NVIDIA H200 across two nodes, 2x EKS `p5en.48xlarge` (8x H200 141 GiB each, 1128 GiB/node, 3200 Gbps EFA, placed in an EC2 UltraCluster; [spec](https://docs.aws.amazon.com/ec2/latest/instancetypes/ac.html)). The BF16 weights are ~1.5 TB; at ~127 GiB usable per H200 (90% of 141 GiB) one 8-GPU node (~1.0 TB) can't hold them, so two nodes (~2.0 TB usable) are the floor. That leaves ~0.5 TB across the gang for KV cache and activations, ample for a single smoke-test request at a capped context. The FP8 build (~0.75 TB) would fit one node.

**Topology.** One engine as a `Leader` plus `Worker` gang (`worker.nodes: 1`), eight H200 per node, spanning the two nodes via LeaderWorkerSet. Parallelism is the engine's concern. vLLM recommends TP = GPUs-per-node, PP = nodes for multi-node, so `--tensor-parallel-size=8 --pipeline-parallel-size=2` across the gang, written as vLLM flags on each member, with the worker addressing the leader through the injected `$(MODELPLANE_LEADER_ADDRESS)`. Tool calling needs `--enable-auto-tool-choice` plus the GLM tool-call parser. `--reasoning-parser glm45` is the established reasoning parser for the GLM family; the tool-call parser is either `glm45` (proven for GLM-4.5/4.6) or `glm47` (the newer GLM tool-call format), so try `glm45` first and fall back to `glm47` if tool calls leak into content. Include `--trust-remote-code`, and cap context with `--max-model-len` since the native 1M won't fit (start ~32K). Multi-node engines require a `ModelCache`: ~1.5 TB from the public Hugging Face repo.

**Success criteria.**
- [ ] The ~1.5 TB `ModelCache` hydrates from the public repo and reports Ready on the target cluster.
- [ ] The gang schedules across the two H200 nodes and the replica becomes ready.
- [ ] A chat completion through the `ModelService`'s OpenAI-compatible endpoint returns a response.
- [ ] An agentic / tool-calling request round-trips, validating the GLM tool-call parser (`glm45` or `glm47`).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Verify Modelplane can serve GLM-5.2 #208

What problem are you facing?

How could Modelplane help solve your problem?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Verify Modelplane can serve GLM-5.2 #208

Description

What problem are you facing?

How could Modelplane help solve your problem?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions