-
-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI/Build][ROCm] Enabling LoRA tests on ROCm #7369
Merged
simon-mo
merged 32 commits into
vllm-project:main
from
alexeykondrat:lora_test_enablement
Sep 4, 2024
Merged
Changes from 1 commit
Commits
Show all changes
32 commits
Select commit
Hold shift + click to select a range
31655b7
Modifying test_quant_model.py, AWQ is not supported on ROCm
alexeykondrat 26d8429
xfailing Gemma test for further investigation
alexeykondrat 670e3c6
Enabling LoRA tests for AMD in Buildkite
alexeykondrat a58f019
Adding reason for Gemma test xfail
alexeykondrat 39f9bec
Fixing MODELS re-definition
alexeykondrat d8207ed
Removing - csrc/punica for LoRA dependancies
alexeykondrat df0efb1
Sorting imports
alexeykondrat a4f78e5
Update alignment
alexeykondrat 91ce9c4
Make yapf happy
alexeykondrat 32f3a10
Update test_quant_model.py
alexeykondrat a103778
Update test_quant_model.py
alexeykondrat 1d35abb
Make yapf(3.11) happy
alexeykondrat 1d1a86c
Removing csrc/punica dependency for LoRA long context test test-pipe…
alexeykondrat 85f76e9
Exposing single GPU to the container
alexeykondrat 23017cc
Passing Bildkite env vars to container for pytest
alexeykondrat d7cde25
Explicitly setting number of parallel jobs(shards in pytest) to 1
alexeykondrat f7794a6
Removing unused arguments in test shell script
alexeykondrat d371265
Placing the quotes around the test commands
alexeykondrat f853c58
Remove single quotes
alexeykondrat fa9f388
Automatically replacing CUDA_VISIBLE_DEVICES with HIP_VISIBLE_DEVICE…
alexeykondrat 64f12e4
trying to run four instances in parallel
alexeykondrat dfb0e4c
adding GPU number to container name
alexeykondrat 27f6e00
Re-enabling LoRA tests
alexeykondrat bbe7696
Running four docker processes in parallel
alexeykondrat 46682e3
Using pipefail option to propagate the error code
alexeykondrat b3fc101
Run 8 parallel jobs
alexeykondrat dd12ded
Fixing pipe operator
alexeykondrat 6c43c65
Removing comment
alexeykondrat 5c6cdc3
Merge remote-tracking branch 'upstream/main' into lora_test_enablement
alexeykondrat c48c569
Resolving conflict in run-amd-test.sh and merging with main
alexeykondrat 16ea3cf
Update comment in .buildkite/run-amd-test.sh
alexeykondrat 1de52bd
Removed commented out string run-amd-test.sh
alexeykondrat File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
trying to run four instances in parallel
- Loading branch information
commit 64f12e43cd648ffc12d3d48804ec0b0071bcd358
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an incorrect implementation of the sharding. Buildkite should already started X number of jobs under the same name. Each run script should just receive the environment variable, and pass it along to the command.
The current implementation is trying to run all shards in the same command
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in more detail, it looks like the they are indeed running in the same job in parallel
https://buildkite.com/vllm/ci-aws/builds/7784#01919a58-e1eb-48b3-9fd5-872f0328e913
this might break more often than we wanted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This invocation is compatible with the general way we launch tests. IMHO unless there is a problem with execution of the "payload" tests, we shouldn't be restricted in the way we implement the invocation logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our engineering choices are dictated by the specific nature of our HW infrastructure and its initialization/decoupling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot use docker Buildkite plugin, so we have parallelize the jobs ourselves. Our shell script receives the command with empty "--shard-id=" argument, so we have to substitute it and run as background jobs while exposing one GPU to each job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry how many GPUs is there on each node? You can run multiple buildkite agent on the host and pin each to a GPU using environment variable. This can drastically help accelerate the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my main concern with this approach is now the way sharding is handling is implemented differently and can cause issues when developers are debugging the test failures on amd devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each node has 8 GPUs. The restarting procedure is indiscriminate though,- we're restarting all GPUs on a given node at once. This strategy has advantage of complete between-test decoupling. The unfortunate downside is that we can't rely on multiple Buildkite agents running on the same host.
We achieved the current level of HW stability with this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Let's refine this PR a bit and we can merge it in