Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel test with pytest-xdist #518

Merged
merged 8 commits into from
Jul 18, 2024
Merged

Conversation

gau-nernst
Copy link
Collaborator

@gau-nernst gau-nernst commented Jul 17, 2024

Currently this PR creates 2 workers, each with 2 GPUs

pytest test --verbose -s --dist load --tx popen//env:CUDA_VISIBLE_DEVICES=0,1 --tx popen//env:CUDA_VISIBLE_DEVICES=2,3

I think only FSDP tests require 2 GPUs

ao/test/dtypes/test_nf4.py

Lines 484 to 494 in 4f53882

class TestQLoRA(FSDPTest):
@property
def world_size(self) -> int:
return 2
@pytest.mark.skipif(
version.parse(torch.__version__).base_version < "2.4.0",
reason="torch >= 2.4 required",
)
@skip_if_lt_x_gpu(2)
def test_qlora_fsdp2(self):

class TestFSDP2(FSDPTest):
@property
def world_size(self) -> int:
return 2
@pytest.mark.skipif(not TORCH_VERSION_AFTER_2_4, reason="torch >= 2.4 required")
@skip_if_lt_x_gpu(2)
def test_fsdp2(self):

Thus, CI runtime can be further improved by

  1. Mark these tests as "require 2 GPUs"
  2. Run the remaining tests with 4 workers, each with 1 GPU
  3. Run the 2-GPU tests in a separate step. Can either be 2 workers with 2 GPUs each, or just normal, non-parallel pytest since there are not a lot of 2-GPU tests right now.

UPDATE: parallel tests with 4 1-GPU workers + sequential multi-GPU tests is implemented now. Some problems I encountered along the way (for future reference and improvement). pytest-xdist provides multiple methods to distribute work across workers. Refer to here for the full list. I have tried the following

  • --dist load (send each test to next available worker):
    • TestSaveLoadMeta use the same filename test.pth which is problematic when multiple tests write/read the same file. I made a unique name for each test to solve it for now, though the better way is to use pytest tmp_path fixture (not compatible with unittest.TestCase).
    • test_quant_llm_quantize_ebits_2_mbits_2 gets a strange torch.compile warning in PyTorch 2.3 only torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py:436] [0/48] failed to eagerly compile backwards for dynamic, suppressing in case backwards not needed. Looks harmless though.
  • --dist loadscope (send a TestClass to next available worker):
    • TestSaveLoadMeta gets a strange error - RuntimeError: Error: accessing tensor output of CUDAGraphs that has been overwritten by a subsequent forward run. Stack trace: [Could not find stack trace]. To prevent overwriting, clone the tensor outside of torch.compile() or call torch.compiler.cudagraph_mark_step_begin() before each model invocation.. No traceback. Since tests within a class are run by the same worker, there shouldn't be filename conflict test.py. Perhaps some strange interaction between how pytest-xdist sends test to the subprocess worker and CUDA graphs + torch.compile?
  • --dist loadfile (send each file to next available worker): this should be the least error-prone. However, there is little speedup (24min vs 35min) since test_integration.py file is huge -> load imbalance (probably a good idea to split this into smaller files anyway).

The current approach of this PR is using --dist load. To summarize

Approach CUDA nightly CI time
pytest sequential (baseline) 35min
pytest-xdist with two 2-GPU workers (this PR, original) 21min
pytest-xdist with four 1-GPU workers + sequential multi-GPU tests (this PR, latest) 19min

I also notice that it takes 3min to pull the Docker image and another 3-4min to download and install packages. So it seems like two 2-GPU workers have pretty good scaling (7 + 14 vs 7 + 28), while four 1-GPU workers is not much of an improvement (7 + 12 vs 7 + 28). I guess the 2-GPU FSDP tests can be run in parallel too for the latest configuration.

Copy link

pytorch-bot bot commented Jul 17, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/518

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4c3c1fd with merge base 4f53882 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 17, 2024
@gau-nernst gau-nernst marked this pull request as ready for review July 18, 2024 02:29
@msaroufim
Copy link
Member

What I had tried messing around with before was using https://docs.pytest.org/en/stable/how-to/mark.html

which is useful to mark the 2 gpu and 1 gpu tests independently, and in the github action file you can just have 2 pytest commands to run them one after the other

@gau-nernst
Copy link
Collaborator Author

Yup that's what I have in mind too. Just want to make sure only those 2 tests use 2 GPUs.

@msaroufim
Copy link
Member

We can merge this but let's keep an eye out on whether this makes debugging test failures more tricky to debug in teh future

@msaroufim msaroufim mentioned this pull request Jul 18, 2024
@msaroufim msaroufim merged commit cbaff6c into pytorch:main Jul 18, 2024
13 checks passed
@gau-nernst gau-nernst deleted the improve_ci branch July 18, 2024 19:36
@gau-nernst
Copy link
Collaborator Author

@msaroufim I keep seeing this test failing

test/integration/test_integration.py::TestSaveLoadMeta::test_save_load_int8woqtensors

Not exactly clear what's causing it. Should we revert or try to fix it?

@msaroufim
Copy link
Member

msaroufim commented Jul 19, 2024

Revert and ask questions later

msaroufim added a commit that referenced this pull request Jul 19, 2024
dbyoung18 pushed a commit to dbyoung18/ao that referenced this pull request Jul 31, 2024
* try pytest-xdist

* run multi_gpu tests separately

* each worker run 1 file

* use loadscope to balance load better (for test_integration.py)

* change back to loadfile

* per-test schedule

* per-test sharding. avoid name collision
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants