Skip to content

[graph_trainer] Fix H100 CI: pre-install spmd_types (torch nightly hard-dep)#3588

Open
SherlockNoMad wants to merge 2 commits into
mainfrom
fix-graph-trainer-h100-spmd-types
Open

[graph_trainer] Fix H100 CI: pre-install spmd_types (torch nightly hard-dep)#3588
SherlockNoMad wants to merge 2 commits into
mainfrom
fix-graph-trainer-h100-spmd-types

Conversation

@SherlockNoMad

Copy link
Copy Markdown
Contributor

Problem

The GraphTrainer 8 GPU H100 Integration Tests workflow has been failing on main for a while. The current (and recurring most-recently) failure is during dependency install:

Collecting spmd-types==0.2.0 (from torch)
  WARNING: Retrying ... ConnectionResetError(104, 'Connection reset by peer'): /packages/.../spmd_types-0.2.0-py3-none-any.whl.metadata
ERROR: Could not install packages due to an OSError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Max retries exceeded

Root cause

The nightly torch wheel declares spmd_types==0.2.0 as a hard dependency. The Meta-internal H100 runner can reach download.pytorch.org but not the public PyPI file host (files.pythonhosted.org). The torch install step deliberately clears PIP_EXTRA_INDEX_URL (the reachable in-cluster cache) "so the default cpu index can't supply a +cpu torch" — so when pip resolves torch's spmd_types dependency it falls back to public PyPI, which the runner cannot reach, and the install fails on every run.

(The docker image bakes spmd_types==0.2.1 from .ci/docker/requirements.txt, which does not satisfy torch's exact ==0.2.0 pin, so pip tries to fetch 0.2.0 regardless.)

Fix

Pre-install spmd_types from the in-cluster cache alongside torch's other pure-python deps (the workflow already does this for filelock, sympy, etc. for exactly this reason). Once it is already satisfied, the subsequent torch install does not try to fetch it from public PyPI. The pin matches the version torch nightly requires (0.2.0).

- python -m pip install filelock typing-extensions "setuptools<82" sympy networkx jinja2 fsspec numpy
+ python -m pip install filelock typing-extensions "setuptools<82" sympy networkx jinja2 fsspec numpy spmd_types==0.2.0

Notes

  • The shared .github/scripts/run_8xgpu_integration_tests.sh (main H100 workflow) and the RL H100 workflow have the same latent issue and likely want the same one-line fix; left out here to keep this PR scoped to the graph_trainer H100 CI.
  • This unblocks the install so the test suite can actually run again; any test-level failures surfaced afterward (e.g. nightly hash drift) will be addressed on top.

Test

Triggered via ciflow/h100.8 on this PR.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 9, 2026
@SherlockNoMad SherlockNoMad added the ciflow/h100.8 Trigger H100.8 CI label Jun 9, 2026
…rd-dep)

The nightly torch wheel declares spmd_types==0.2.0 as a hard dependency. The
H100 runner can reach download.pytorch.org but not the public PyPI
(files.pythonhosted.org), and the torch install clears PIP_EXTRA_INDEX_URL (the
reachable in-cluster cache), so resolving spmd_types fails every run with
'Connection reset by peer'.

Pre-install spmd_types from the in-cluster cache alongside torch's other
pure-python deps so the torch install finds it already satisfied. Pin matches
the version the torch nightly requires (0.2.0).
# which clears PIP_EXTRA_INDEX_URL -- does not try to fetch it from the public
# PyPI (files.pythonhosted.org), which this runner cannot reach. Pin matches
# the version torch nightly requires.
python -m pip install filelock typing-extensions "setuptools<82" sympy networkx jinja2 fsspec numpy spmd_types==0.2.0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, right now graph trainer doesn't require spmd_types package, can we skip for now?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not torchtitan/graph_trainer pulling it — spmd_types==0.2.0 is a hard dependency of the torch nightly wheel itself (Collecting spmd-types==0.2.0 (from torch)), so pip install torch resolves it regardless of what graph_trainer needs. The H100 runner can reach download.pytorch.org but not public PyPI (files.pythonhosted.org), and we clear PIP_EXTRA_INDEX_URL for the torch install, so the fetch dies with "Connection reset". There's no clean way to skip a single torch hard-dep via pip, so I pre-install it from the reachable in-cluster cache — the same line already does this for torch's other pure-python deps (filelock/sympy/…). Confirmed working in the latest run (it resolved from the cache, no connection reset).

The image bakes spmd_types==0.2.1 (from .ci/docker/requirements.txt) which doesn't satisfy torch's exact ==0.2.0, which is why pip still tries to re-fetch. If you'd rather keep the workflow untouched, the alternative is aligning the .ci/docker pin to 0.2.0 — happy to switch to that. Either way it can't be fully skipped while the nightly hard-deps it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pianpwk -- is it expected for spmd_types to be a hard dependency of torch?

@pianpwk pianpwk Jun 9, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this happens for CI wheels? I don't think running generic pip install torch gets it. never mind, it gets installed with nightlies

It sounds like we could update pytorch's dependency to 0.2.1? I think it should stay pinned there for a while (that version should be sufficient for titan + spmd types), while we figure this out

pytorch/pytorch#186803

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, but is it expected for spmd_types to be a hard dependency for the nightly? i thought it is an optional dependency

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me figure this out.. I think many things are true right now:

  • pip install torch getting spmd_types is desirable for spmd_types' discoverability.
  • technically pytorch will run without spmd_types installed (e.g. building from source shouldn't install it), though with the nightly situation this is rarer. So technically it's not a hard requirement.
  • But titan CI via torch nightly, needs it to workout, so in some sense it is a requirement?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My expectation was that if someone wanted the spmd_types-parts of torch to work, they would have to install it directly or via torchannex (for which spmd_types can be a hard dependency). It's possible I am mistaken...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TorchTitan has a hard dependency on spmd_types now because the core components, e.g., sharding.py import spmd_types. So yes, graph_trainer depends on spmd_types.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I expect torchtitan will have a hard dependency on spmd_types. I was surprised by torch's

assert_expected_inline(
model_hash,
"""d8c4495bc41d103e3864433002d31be0823567938729396c44eb2f2782a47a23""",
"""7b0175b74697dd569a152d1180eca4aafe49fee555844501894888ce79c9e8d9""",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is numerics change expected?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes — expected upstream cu130-nightly drift, not a torchtitan change. In the same CI run the trace path still bitwise-matches eager (test_*_aot_fx_trace*_vs_eager pass), so eager and traced drifted together; only the hardcoded eager self-deterministic baselines needed refreshing. Same recurring nightly drift re-baselined before in #3482 / #3484. (DSv3's loss moved too, hence the loss update on the DSv3 classes.)

… nightly

The hardcoded model_hash/grad_hash (and DSv3 loss) baselines in
test_bitwise_deterministic.py drifted on the current cu130 torch nightly, failing
test_eager_self_deterministic for all 6 model classes. This is upstream nightly
numeric drift, not a torchtitan regression: test_*_aot_fx_trace*_vs_eager still
passes (the trace path bitwise-matches eager), so eager and traced drifted
together. Values taken from the H100 CI run (authoritative cu130 environment).
@SherlockNoMad SherlockNoMad force-pushed the fix-graph-trainer-h100-spmd-types branch from f2fc60b to d21fd0c Compare June 9, 2026 14:05
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/h100.8 Trigger H100.8 CI ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants