[Prototype] Run tests in parallel #273

jlamypoirier · 2025-05-16T23:55:37Z

✨ Description

Allows running tests in parallel and using all the available gpus so we can run lots of tests fast. Pytest-xdist is already relatively good, but puts everything in the first GPU(s) and risks causing OOMs, port conflicts and other issues. I made a simple allocation and locking mechanism to prevent such issues, adapted from pytest-xdist-lock.

The system comes in a few steps:

Test request a certain amount of gpus, gpu memory and ports through the get_test_resources mark or a specialized decorator such as requires_cuda.
The lock adapter safely allocates the gpu(s). It sets the default device to the first allocated one and restrict gpu usage through set_per_process_memory_fraction (5 GB by default for requested devices, 0 for other gpus), which is good enough for many tests.
For simple tests nothing more is needed, but more complex ones need to know the allocated gpus and ports, which they get through the get_test_resources fixture. This include fast-llm runs and distributed configs, for which I added config options and the get_distributed_config fixture, and Megatron runs which use CUDA_VISIBLE_DEVICES.
Once the test is done, the lock adapter checks that the allocation was respected, ensures that the GPU memory is de-allocated, and unlock the resources for other tests.

What remains is to ensure that dependencies between tests are respected (i.e. that pytest-xdist and pytest-depends are compatible enough), and that shared resource files (ex. test dataset) are parallel-safe.

I got things to a relatively stable state up to ~20 workers, but things start to break above it. It's still enough to reduce slow tests from 8 minutes to ~2 minutes, most of which comes from parallel overhead (~1 minute) and the slowest test (~40 s), so it adds room for lots of extra tests.

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

jlamypoirier · 2025-05-16T23:58:11Z

tests/test_simple.py

-        num_gpus=2,
-        compare=f"test_{TEST_MODEL}",
-    )
+    with gpu_lock(num_gpus=2):


Ideally I'd use a mark or a better fixture to avoid this. Any pytest expert here knows how to do it? (@bigximik @tscholak)

how have you solved this?

I did this, it seems to work https://github.com/ServiceNow/Fast-LLM/pull/273/files#diff-e52e4ddd58b7ef887ab03c04116e676f6280b824ab7469d5d3080e5cba4f2128R350.

Parallel tests

5f4f2eb

jlamypoirier commented May 16, 2025

View reviewed changes

jlamypoirier added 6 commits May 21, 2025 16:52

wip

f1205be

misc

1031060

Merge remote-tracking branch 'origin/main' into parallel_tests

b975362

stuff

f1f5ec7

fix

177acb3

stuff

8aab483

jlamypoirier mentioned this pull request May 28, 2025

Parallel tests v2 #276

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Prototype] Run tests in parallel #273

[Prototype] Run tests in parallel #273

Uh oh!

jlamypoirier commented May 16, 2025 •

edited

Loading

Uh oh!

jlamypoirier May 16, 2025

Uh oh!

bigximik May 27, 2025

Uh oh!

jlamypoirier May 27, 2025

Uh oh!

Uh oh!

[Prototype] Run tests in parallel #273

Are you sure you want to change the base?

[Prototype] Run tests in parallel #273

Uh oh!

Conversation

jlamypoirier commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

🔍 Type of change

Uh oh!

jlamypoirier May 16, 2025

Choose a reason for hiding this comment

Uh oh!

bigximik May 27, 2025

Choose a reason for hiding this comment

Uh oh!

jlamypoirier May 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jlamypoirier commented May 16, 2025 •

edited

Loading