-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel test with pytest-xdist #518
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/518
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 4c3c1fd with merge base 4f53882 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
What I had tried messing around with before was using https://docs.pytest.org/en/stable/how-to/mark.html which is useful to mark the 2 gpu and 1 gpu tests independently, and in the github action file you can just have 2 pytest commands to run them one after the other |
Yup that's what I have in mind too. Just want to make sure only those 2 tests use 2 GPUs. |
We can merge this but let's keep an eye out on whether this makes debugging test failures more tricky to debug in teh future |
@msaroufim I keep seeing this test failing
Not exactly clear what's causing it. Should we revert or try to fix it? |
Revert and ask questions later |
This reverts commit cbaff6c.
* try pytest-xdist * run multi_gpu tests separately * each worker run 1 file * use loadscope to balance load better (for test_integration.py) * change back to loadfile * per-test schedule * per-test sharding. avoid name collision
Currently this PR creates 2 workers, each with 2 GPUs
I think only FSDP tests require 2 GPUs
ao/test/dtypes/test_nf4.py
Lines 484 to 494 in 4f53882
ao/test/prototype/test_low_bit_optim.py
Lines 161 to 168 in 4f53882
Thus, CI runtime can be further improved by
UPDATE: parallel tests with 4 1-GPU workers + sequential multi-GPU tests is implemented now. Some problems I encountered along the way (for future reference and improvement). pytest-xdist provides multiple methods to distribute work across workers. Refer to here for the full list. I have tried the following
--dist load
(send each test to next available worker):TestSaveLoadMeta
use the same filenametest.pth
which is problematic when multiple tests write/read the same file. I made a unique name for each test to solve it for now, though the better way is to use pytesttmp_path
fixture (not compatible withunittest.TestCase
).test_quant_llm_quantize_ebits_2_mbits_2
gets a strange torch.compile warning in PyTorch 2.3 onlytorch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py:436] [0/48] failed to eagerly compile backwards for dynamic, suppressing in case backwards not needed
. Looks harmless though.--dist loadscope
(send a TestClass to next available worker):TestSaveLoadMeta
gets a strange error -RuntimeError: Error: accessing tensor output of CUDAGraphs that has been overwritten by a subsequent forward run. Stack trace: [Could not find stack trace]. To prevent overwriting, clone the tensor outside of torch.compile() or call torch.compiler.cudagraph_mark_step_begin() before each model invocation.
. No traceback. Since tests within a class are run by the same worker, there shouldn't be filename conflicttest.py
. Perhaps some strange interaction between how pytest-xdist sends test to the subprocess worker and CUDA graphs + torch.compile?--dist loadfile
(send each file to next available worker): this should be the least error-prone. However, there is little speedup (24min vs 35min) sincetest_integration.py
file is huge -> load imbalance (probably a good idea to split this into smaller files anyway).The current approach of this PR is using
--dist load
. To summarizeI also notice that it takes 3min to pull the Docker image and another 3-4min to download and install packages. So it seems like two 2-GPU workers have pretty good scaling (7 + 14 vs 7 + 28), while four 1-GPU workers is not much of an improvement (7 + 12 vs 7 + 28). I guess the 2-GPU FSDP tests can be run in parallel too for the latest configuration.