[graph_trainer] Fix H100 CI: pre-install spmd_types (torch nightly hard-dep) by SherlockNoMad · Pull Request #3588 · pytorch/torchtitan

SherlockNoMad · 2026-06-09T07:37:19Z

Problem

The GraphTrainer 8 GPU H100 Integration Tests workflow has been failing on main for a while. The current (and recurring most-recently) failure is during dependency install:

Collecting spmd-types==0.2.0 (from torch)
  WARNING: Retrying ... ConnectionResetError(104, 'Connection reset by peer'): /packages/.../spmd_types-0.2.0-py3-none-any.whl.metadata
ERROR: Could not install packages due to an OSError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Max retries exceeded

Root cause

The nightly torch wheel declares spmd_types==0.2.0 as a hard dependency. The Meta-internal H100 runner can reach download.pytorch.org but not the public PyPI file host (files.pythonhosted.org). The torch install step deliberately clears PIP_EXTRA_INDEX_URL (the reachable in-cluster cache) "so the default cpu index can't supply a +cpu torch" — so when pip resolves torch's spmd_types dependency it falls back to public PyPI, which the runner cannot reach, and the install fails on every run.

(The docker image bakes spmd_types==0.2.1 from .ci/docker/requirements.txt, which does not satisfy torch's exact ==0.2.0 pin, so pip tries to fetch 0.2.0 regardless.)

Fix

Pre-install spmd_types from the in-cluster cache alongside torch's other pure-python deps (the workflow already does this for filelock, sympy, etc. for exactly this reason). Once it is already satisfied, the subsequent torch install does not try to fetch it from public PyPI. The pin matches the version torch nightly requires (0.2.0).

- python -m pip install filelock typing-extensions "setuptools<82" sympy networkx jinja2 fsspec numpy
+ python -m pip install filelock typing-extensions "setuptools<82" sympy networkx jinja2 fsspec numpy spmd_types==0.2.0

Notes

The shared .github/scripts/run_8xgpu_integration_tests.sh (main H100 workflow) and the RL H100 workflow have the same latent issue and likely want the same one-line fix; left out here to keep this PR scoped to the graph_trainer H100 CI.
This unblocks the install so the test suite can actually run again; any test-level failures surfaced afterward (e.g. nightly hash drift) will be addressed on top.

Test

Triggered via ciflow/h100.8 on this PR.

…rd-dep) The nightly torch wheel declares spmd_types==0.2.0 as a hard dependency. The H100 runner can reach download.pytorch.org but not the public PyPI (files.pythonhosted.org), and the torch install clears PIP_EXTRA_INDEX_URL (the reachable in-cluster cache), so resolving spmd_types fails every run with 'Connection reset by peer'. Pre-install spmd_types from the in-cluster cache alongside torch's other pure-python deps so the torch install finds it already satisfied. Pin matches the version the torch nightly requires (0.2.0).

tianyu-l · 2026-06-09T08:15:45Z

+        # which clears PIP_EXTRA_INDEX_URL -- does not try to fetch it from the public
+        # PyPI (files.pythonhosted.org), which this runner cannot reach. Pin matches
+        # the version torch nightly requires.
+        python -m pip install filelock typing-extensions "setuptools<82" sympy networkx jinja2 fsspec numpy spmd_types==0.2.0


IIUC, right now graph trainer doesn't require spmd_types package, can we skip for now?

It's not torchtitan/graph_trainer pulling it — spmd_types==0.2.0 is a hard dependency of the torch nightly wheel itself (Collecting spmd-types==0.2.0 (from torch)), so pip install torch resolves it regardless of what graph_trainer needs. The H100 runner can reach download.pytorch.org but not public PyPI (files.pythonhosted.org), and we clear PIP_EXTRA_INDEX_URL for the torch install, so the fetch dies with "Connection reset". There's no clean way to skip a single torch hard-dep via pip, so I pre-install it from the reachable in-cluster cache — the same line already does this for torch's other pure-python deps (filelock/sympy/…). Confirmed working in the latest run (it resolved from the cache, no connection reset).

The image bakes spmd_types==0.2.1 (from .ci/docker/requirements.txt) which doesn't satisfy torch's exact ==0.2.0, which is why pip still tries to re-fetch. If you'd rather keep the workflow untouched, the alternative is aligning the .ci/docker pin to 0.2.0 — happy to switch to that. Either way it can't be fully skipped while the nightly hard-deps it.

@pianpwk -- is it expected for spmd_types to be a hard dependency of torch?

~~I think this happens for CI wheels? I don't think running generic pip install torch gets it.~~ never mind, it gets installed with nightlies

It sounds like we could update pytorch's dependency to 0.2.1? I think it should stay pinned there for a while (that version should be sufficient for titan + spmd types), while we figure this out

pytorch/pytorch#186803

thanks, but is it expected for spmd_types to be a hard dependency for the nightly? i thought it is an optional dependency

Let me figure this out.. I think many things are true right now:

pip install torch getting spmd_types is desirable for spmd_types' discoverability.

technically pytorch will run without spmd_types installed (e.g. building from source shouldn't install it), though with the nightly situation this is rarer. So technically it's not a hard requirement.

But titan CI via torch nightly, needs it to workout, so in some sense it is a requirement?

My expectation was that if someone wanted the spmd_types-parts of torch to work, they would have to install it directly or via torchannex (for which spmd_types can be a hard dependency). It's possible I am mistaken...

TorchTitan has a hard dependency on spmd_types now because the core components, e.g., sharding.py import spmd_types. So yes, graph_trainer depends on spmd_types.

Yes I expect torchtitan will have a hard dependency on spmd_types. I was surprised by torch's

tianyu-l · 2026-06-09T08:15:55Z

        assert_expected_inline(
            model_hash,
-            """d8c4495bc41d103e3864433002d31be0823567938729396c44eb2f2782a47a23""",
+            """7b0175b74697dd569a152d1180eca4aafe49fee555844501894888ce79c9e8d9""",


Is numerics change expected?

Yes — expected upstream cu130-nightly drift, not a torchtitan change. In the same CI run the trace path still bitwise-matches eager (test_*_aot_fx_trace*_vs_eager pass), so eager and traced drifted together; only the hardcoded eager self-deterministic baselines needed refreshing. Same recurring nightly drift re-baselined before in #3482 / #3484. (DSv3's loss moved too, hence the loss update on the DSv3 classes.)

… nightly The hardcoded model_hash/grad_hash (and DSv3 loss) baselines in test_bitwise_deterministic.py drifted on the current cu130 torch nightly, failing test_eager_self_deterministic for all 6 model classes. This is upstream nightly numeric drift, not a torchtitan regression: test_*_aot_fx_trace*_vs_eager still passes (the trace path bitwise-matches eager), so eager and traced drifted together. Values taken from the H100 CI run (authoritative cu130 environment).

trying to unblock pytorch/torchtitan#3588 (comment) Pull Request resolved: #186803 Approved by: https://github.com/Skylion007, https://github.com/aditvenk, https://github.com/malfet

SherlockNoMad requested review from fegin, tianyu-l, wconstab and wwwjn as code owners June 9, 2026 07:37

pytorch-bot Bot added the ciflow/8gpu label Jun 9, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 9, 2026

SherlockNoMad added the ciflow/h100.8 Trigger H100.8 CI label Jun 9, 2026

SherlockNoMad force-pushed the fix-graph-trainer-h100-spmd-types branch from 379c20b to a363002 Compare June 9, 2026 07:39

SherlockNoMad requested review from IvanKobzarev, aditvenk, sanketpurandare and xmfan as code owners June 9, 2026 08:13

tianyu-l reviewed Jun 9, 2026

View reviewed changes

SherlockNoMad force-pushed the fix-graph-trainer-h100-spmd-types branch from f2fc60b to d21fd0c Compare June 9, 2026 14:05

pianpwk mentioned this pull request Jun 9, 2026

update spmd_types to 0.2.1 pytorch/pytorch#186803

Closed

mlazos approved these changes Jun 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[graph_trainer] Fix H100 CI: pre-install spmd_types (torch nightly hard-dep)#3588

[graph_trainer] Fix H100 CI: pre-install spmd_types (torch nightly hard-dep)#3588
SherlockNoMad wants to merge 2 commits into
mainfrom
fix-graph-trainer-h100-spmd-types

SherlockNoMad commented Jun 9, 2026

Uh oh!

tianyu-l Jun 9, 2026

Uh oh!

SherlockNoMad Jun 9, 2026

Uh oh!

aditvenk Jun 9, 2026

Uh oh!

pianpwk Jun 9, 2026 •

edited

Loading

Uh oh!

aditvenk Jun 9, 2026

Uh oh!

pianpwk Jun 9, 2026

Uh oh!

aditvenk Jun 9, 2026

Uh oh!

fegin Jun 9, 2026

Uh oh!

aditvenk Jun 10, 2026

Uh oh!

tianyu-l Jun 9, 2026

Uh oh!

SherlockNoMad Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

SherlockNoMad commented Jun 9, 2026

Problem

Root cause

Fix

Notes

Test

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pianpwk Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pianpwk Jun 9, 2026 •

edited

Loading