Skip to content

sglang disagg: opt-in RDMA device restriction + benchmark harness fixes & metrics display#200

Open
atnair-amd wants to merge 3 commits into
mainfrom
atnair/sglang-disagg-rdma-restrict-metrics
Open

sglang disagg: opt-in RDMA device restriction + benchmark harness fixes & metrics display#200
atnair-amd wants to merge 3 commits into
mainfrom
atnair/sglang-disagg-rdma-restrict-metrics

Conversation

@atnair-amd

@atnair-amd atnair-amd commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

Three related changes to the SGLang disaggregated-PD inference suite, developed and validated together on a 2-node MI300 + Thor2 RoCE setup.

1. Opt-in RDMA device restriction (restrict_rdma_devices)

A --privileged container exposes every host RDMA device under /dev/infiniband regardless of --device, so NCCL_IB_HCA only limits usage, not discovery. New opt-in path launches the inference container unprivileged with only the configured HCAs' /dev/infiniband/uverbsN nodes exposed, so ibv_devinfo inside the container is restricted to that set.

  • docker_lib.launch_docker_container: backward-compatible privileged=True / extra_run_args='' kwargs (defaults unchanged for all other suites).
  • linux_utils.get_uverbs_devices_for_hcas: resolve ibdev names to per-node /dev/infiniband/uverbsN (+ rdma_cm) via /sys/class/infiniband_verbs.
  • sglang_llama_70b_distributed: when restrict_rdma_devices is set, resolve the configured HCAs per node and launch unprivileged.

2. Benchmark harness fixes (pre-existing bugs, unrelated to the workload)

  • gsm8k + bench_serv targeted 0.0.0.0 (localhost on the benchmark node) instead of the router node -> connection refused; now target proxy_router_node.
  • bench_serv: PYTHONPATH=/sgl-workspace/sglang/python so sglang.bench_serving imports on images whose editable-install finder is stale.
  • bench_serv: optional --dataset-path (config bench_dataset_path) for a pre-staged corpus under HF_HUB_OFFLINE; optional --max-concurrency (config max_concurrency) so it doesn't flood the deployment.
  • exec_nic_setup_scripts (thor2/broadcom): run ibv_devinfo after copying the bnxt_re driver and verify devices enumerate, instead of matching a fixed bnxt_ name (HCAs may enumerate as rocepXXs0).

3. Structured metrics display

  • parse_bench_serv_metrics parses the full Serving Benchmark Result block, fixing the unescaped-paren guards (median/p99 TTFT/TPOT) and the E2EL vs E2E Latency mismatch, and adding previously-unparsed fields (~28 total).
  • gsm8k parses accuracy/invalid/latency/tokens_per_sec into the results dict.
  • Uniform per-node metrics table with per-threshold PASS/FAIL verdicts.
  • Per-item detail: gsm8k per-question (--raw-result-file) and bench_serv per-request (--output-file/--output-details), logged as a compact table.

Out of scope

  • The core42 cluster/config JSON used to validate (lives in a separate config repo).
  • Copying the per-item JSONL artifacts into the devbox run dir/bundle.
  • Per-percentile threshold gating.

Test plan

  • Local gate: make fmt-check && make lint && make test -- all pass.
  • On-cluster: full sglang_llama_70b_distributed suite, 2-node disagg (1 prefill/router + 1 decode/bench), unprivileged restricted-RDMA containers -- green (10/10). ibv_devinfo inside the container shows exactly the 8 configured HCAs (4 excluded); gsm8k 1017 tok/s (941/1000 correct); bench_serv 300/300 successful, 883 tok/s, with full per-item tables.
  • Parsers additionally unit-validated offline against real captured benchmark output.

…ured HCAs only)

A privileged container exposes every host RDMA device under /dev/infiniband
regardless of --device, so NCCL_IB_HCA only limits usage, not discovery. Add an
opt-in path that launches the inference container unprivileged with only the
configured HCAs' /dev/infiniband/uverbsN nodes exposed, so ibv_devinfo inside
the container is restricted to that set.

- docker_lib.launch_docker_container: add backward-compatible privileged=True
  and extra_run_args='' kwargs; defaults preserve prior behavior for all suites.
- linux_utils.get_uverbs_devices_for_hcas: resolve ibdev names to per-node
  /dev/infiniband/uverbsN (+ rdma_cm) via /sys/class/infiniband_verbs.
- sglang_llama_70b_distributed: when config restrict_rdma_devices is set, resolve
  the configured HCAs per node and launch unprivileged with an explicit device list.

Signed-off-by: Atul Nair <Atul.Nair@amd.com>
Benchmark harness fixes (gsm8k + bench_serv), independent of the RDMA work:
- target proxy_router_node instead of 0.0.0.0 (the benchmark client runs on the
  benchmark node while the router runs on another node, so localhost refused).
- bench_serv: PYTHONPATH=/sgl-workspace/sglang/python so sglang.bench_serving
  imports on images whose editable-install finder is stale.
- bench_serv: optional --dataset-path (config bench_dataset_path) for a
  pre-staged corpus under HF_HUB_OFFLINE; optional --max-concurrency
  (config max_concurrency) so it does not flood the deployment.
- exec_nic_setup_scripts (thor2/broadcom): run ibv_devinfo after copying the
  bnxt_re driver and verify devices enumerate, instead of matching a fixed
  bnxt_ name prefix (HCAs may enumerate as rocepXXs0).

Structured metrics display:
- parse_bench_serv_metrics parses the full Serving Benchmark Result block,
  fixing the unescaped-paren guards (median/p99 TTFT/TPOT) and the
  E2EL vs E2E Latency mismatch and adding previously-unparsed fields.
- gsm8k parses accuracy/invalid/latency/tokens_per_sec into the results dict.
- log a uniform per-node metrics table with per-threshold PASS/FAIL verdicts.
- capture per-item detail (gsm8k per-question via --raw-result-file, bench_serv
  per-request via --output-file/--output-details) and log a compact table.
- docs/specs/inference_metrics_display.md documents the design.

Signed-off-by: Atul Nair <Atul.Nair@amd.com>
Drop docs/specs/inference_metrics_display.md; the metrics parsing/display
behavior is described in the code and the PR description.

Signed-off-by: Atul Nair <Atul.Nair@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant