[graph_trainer] Fix H100 CI: pre-install spmd_types (torch nightly hard-dep)#3588
[graph_trainer] Fix H100 CI: pre-install spmd_types (torch nightly hard-dep)#3588SherlockNoMad wants to merge 2 commits into
Conversation
…rd-dep) The nightly torch wheel declares spmd_types==0.2.0 as a hard dependency. The H100 runner can reach download.pytorch.org but not the public PyPI (files.pythonhosted.org), and the torch install clears PIP_EXTRA_INDEX_URL (the reachable in-cluster cache), so resolving spmd_types fails every run with 'Connection reset by peer'. Pre-install spmd_types from the in-cluster cache alongside torch's other pure-python deps so the torch install finds it already satisfied. Pin matches the version the torch nightly requires (0.2.0).
379c20b to
a363002
Compare
| # which clears PIP_EXTRA_INDEX_URL -- does not try to fetch it from the public | ||
| # PyPI (files.pythonhosted.org), which this runner cannot reach. Pin matches | ||
| # the version torch nightly requires. | ||
| python -m pip install filelock typing-extensions "setuptools<82" sympy networkx jinja2 fsspec numpy spmd_types==0.2.0 |
There was a problem hiding this comment.
IIUC, right now graph trainer doesn't require spmd_types package, can we skip for now?
There was a problem hiding this comment.
It's not torchtitan/graph_trainer pulling it — spmd_types==0.2.0 is a hard dependency of the torch nightly wheel itself (Collecting spmd-types==0.2.0 (from torch)), so pip install torch resolves it regardless of what graph_trainer needs. The H100 runner can reach download.pytorch.org but not public PyPI (files.pythonhosted.org), and we clear PIP_EXTRA_INDEX_URL for the torch install, so the fetch dies with "Connection reset". There's no clean way to skip a single torch hard-dep via pip, so I pre-install it from the reachable in-cluster cache — the same line already does this for torch's other pure-python deps (filelock/sympy/…). Confirmed working in the latest run (it resolved from the cache, no connection reset).
The image bakes spmd_types==0.2.1 (from .ci/docker/requirements.txt) which doesn't satisfy torch's exact ==0.2.0, which is why pip still tries to re-fetch. If you'd rather keep the workflow untouched, the alternative is aligning the .ci/docker pin to 0.2.0 — happy to switch to that. Either way it can't be fully skipped while the nightly hard-deps it.
There was a problem hiding this comment.
@pianpwk -- is it expected for spmd_types to be a hard dependency of torch?
There was a problem hiding this comment.
I think this happens for CI wheels? I don't think running generic never mind, it gets installed with nightliespip install torch gets it.
It sounds like we could update pytorch's dependency to 0.2.1? I think it should stay pinned there for a while (that version should be sufficient for titan + spmd types), while we figure this out
There was a problem hiding this comment.
thanks, but is it expected for spmd_types to be a hard dependency for the nightly? i thought it is an optional dependency
There was a problem hiding this comment.
Let me figure this out.. I think many things are true right now:
- pip install torch getting spmd_types is desirable for spmd_types' discoverability.
- technically pytorch will run without spmd_types installed (e.g. building from source shouldn't install it), though with the nightly situation this is rarer. So technically it's not a hard requirement.
- But titan CI via torch nightly, needs it to workout, so in some sense it is a requirement?
There was a problem hiding this comment.
My expectation was that if someone wanted the spmd_types-parts of torch to work, they would have to install it directly or via torchannex (for which spmd_types can be a hard dependency). It's possible I am mistaken...
There was a problem hiding this comment.
TorchTitan has a hard dependency on spmd_types now because the core components, e.g., sharding.py import spmd_types. So yes, graph_trainer depends on spmd_types.
There was a problem hiding this comment.
Yes I expect torchtitan will have a hard dependency on spmd_types. I was surprised by torch's
| assert_expected_inline( | ||
| model_hash, | ||
| """d8c4495bc41d103e3864433002d31be0823567938729396c44eb2f2782a47a23""", | ||
| """7b0175b74697dd569a152d1180eca4aafe49fee555844501894888ce79c9e8d9""", |
There was a problem hiding this comment.
Is numerics change expected?
There was a problem hiding this comment.
Yes — expected upstream cu130-nightly drift, not a torchtitan change. In the same CI run the trace path still bitwise-matches eager (test_*_aot_fx_trace*_vs_eager pass), so eager and traced drifted together; only the hardcoded eager self-deterministic baselines needed refreshing. Same recurring nightly drift re-baselined before in #3482 / #3484. (DSv3's loss moved too, hence the loss update on the DSv3 classes.)
… nightly The hardcoded model_hash/grad_hash (and DSv3 loss) baselines in test_bitwise_deterministic.py drifted on the current cu130 torch nightly, failing test_eager_self_deterministic for all 6 model classes. This is upstream nightly numeric drift, not a torchtitan regression: test_*_aot_fx_trace*_vs_eager still passes (the trace path bitwise-matches eager), so eager and traced drifted together. Values taken from the H100 CI run (authoritative cu130 environment).
f2fc60b to
d21fd0c
Compare
trying to unblock pytorch/torchtitan#3588 (comment) Pull Request resolved: #186803 Approved by: https://github.com/Skylion007, https://github.com/aditvenk, https://github.com/malfet
Problem
The GraphTrainer 8 GPU H100 Integration Tests workflow has been failing on
mainfor a while. The current (and recurring most-recently) failure is during dependency install:Root cause
The nightly torch wheel declares
spmd_types==0.2.0as a hard dependency. The Meta-internal H100 runner can reachdownload.pytorch.orgbut not the public PyPI file host (files.pythonhosted.org). The torch install step deliberately clearsPIP_EXTRA_INDEX_URL(the reachable in-cluster cache) "so the default cpu index can't supply a +cpu torch" — so when pip resolves torch'sspmd_typesdependency it falls back to public PyPI, which the runner cannot reach, and the install fails on every run.(The docker image bakes
spmd_types==0.2.1from.ci/docker/requirements.txt, which does not satisfy torch's exact==0.2.0pin, so pip tries to fetch0.2.0regardless.)Fix
Pre-install
spmd_typesfrom the in-cluster cache alongside torch's other pure-python deps (the workflow already does this forfilelock,sympy, etc. for exactly this reason). Once it is already satisfied, the subsequent torch install does not try to fetch it from public PyPI. The pin matches the version torch nightly requires (0.2.0).Notes
.github/scripts/run_8xgpu_integration_tests.sh(main H100 workflow) and the RL H100 workflow have the same latent issue and likely want the same one-line fix; left out here to keep this PR scoped to the graph_trainer H100 CI.Test
Triggered via
ciflow/h100.8on this PR.