Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI/Build][ROCm] Enabling LoRA tests on ROCm #7369

Merged
merged 32 commits into from
Sep 4, 2024
Merged
Changes from 1 commit
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
31655b7
Modifying test_quant_model.py, AWQ is not supported on ROCm
alexeykondrat Aug 9, 2024
26d8429
xfailing Gemma test for further investigation
alexeykondrat Aug 9, 2024
670e3c6
Enabling LoRA tests for AMD in Buildkite
alexeykondrat Aug 9, 2024
a58f019
Adding reason for Gemma test xfail
alexeykondrat Aug 9, 2024
39f9bec
Fixing MODELS re-definition
alexeykondrat Aug 9, 2024
d8207ed
Removing - csrc/punica for LoRA dependancies
alexeykondrat Aug 12, 2024
df0efb1
Sorting imports
alexeykondrat Aug 14, 2024
a4f78e5
Update alignment
alexeykondrat Aug 14, 2024
91ce9c4
Make yapf happy
alexeykondrat Aug 14, 2024
32f3a10
Update test_quant_model.py
alexeykondrat Aug 14, 2024
a103778
Update test_quant_model.py
alexeykondrat Aug 14, 2024
1d35abb
Make yapf(3.11) happy
alexeykondrat Aug 14, 2024
1d1a86c
Removing csrc/punica dependency for LoRA long context test test-pipe…
alexeykondrat Aug 14, 2024
85f76e9
Exposing single GPU to the container
alexeykondrat Aug 14, 2024
23017cc
Passing Bildkite env vars to container for pytest
alexeykondrat Aug 14, 2024
d7cde25
Explicitly setting number of parallel jobs(shards in pytest) to 1
alexeykondrat Aug 15, 2024
f7794a6
Removing unused arguments in test shell script
alexeykondrat Aug 16, 2024
d371265
Placing the quotes around the test commands
alexeykondrat Aug 16, 2024
f853c58
Remove single quotes
alexeykondrat Aug 16, 2024
fa9f388
Automatically replacing CUDA_VISIBLE_DEVICES with HIP_VISIBLE_DEVICE…
alexeykondrat Aug 16, 2024
64f12e4
trying to run four instances in parallel
alexeykondrat Aug 17, 2024
dfb0e4c
adding GPU number to container name
alexeykondrat Aug 17, 2024
27f6e00
Re-enabling LoRA tests
alexeykondrat Aug 17, 2024
bbe7696
Running four docker processes in parallel
alexeykondrat Aug 19, 2024
46682e3
Using pipefail option to propagate the error code
alexeykondrat Aug 19, 2024
b3fc101
Run 8 parallel jobs
alexeykondrat Aug 19, 2024
dd12ded
Fixing pipe operator
alexeykondrat Aug 19, 2024
6c43c65
Removing comment
alexeykondrat Aug 19, 2024
5c6cdc3
Merge remote-tracking branch 'upstream/main' into lora_test_enablement
alexeykondrat Aug 23, 2024
c48c569
Resolving conflict in run-amd-test.sh and merging with main
alexeykondrat Aug 28, 2024
16ea3cf
Update comment in .buildkite/run-amd-test.sh
alexeykondrat Sep 3, 2024
1de52bd
Removed commented out string run-amd-test.sh
alexeykondrat Sep 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
trying to run four instances in parallel
  • Loading branch information
alexeykondrat committed Aug 22, 2024
commit 64f12e43cd648ffc12d3d48804ec0b0071bcd358
31 changes: 24 additions & 7 deletions .buildkite/run-amd-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -70,20 +70,37 @@ HF_CACHE="$(realpath ~)/huggingface"
mkdir -p ${HF_CACHE}
HF_MOUNT="/root/.cache/huggingface"

commands=${@//"--shard-id= "/}
commands=${commands//"--num-shards= "/}
commands=${commands//CUDA_VISIBLE_DEVICES/HIP_VISIBLE_DEVICES}

docker run \
commands=$@
PARALLEL_JOB_COUNT=4
if [[ $commands == *"--shard-id="* ]]; then
for GPU in $(seq 0 $(($PARALLEL_JOB_COUNT-1))); do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an incorrect implementation of the sharding. Buildkite should already started X number of jobs under the same name. Each run script should just receive the environment variable, and pass it along to the command.

The current implementation is trying to run all shards in the same command

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in more detail, it looks like the they are indeed running in the same job in parallel
https://buildkite.com/vllm/ci-aws/builds/7784#01919a58-e1eb-48b3-9fd5-872f0328e913

this might break more often than we wanted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This invocation is compatible with the general way we launch tests. IMHO unless there is a problem with execution of the "payload" tests, we shouldn't be restricted in the way we implement the invocation logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our engineering choices are dictated by the specific nature of our HW infrastructure and its initialization/decoupling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot use docker Buildkite plugin, so we have parallelize the jobs ourselves. Our shell script receives the command with empty "--shard-id=" argument, so we have to substitute it and run as background jobs while exposing one GPU to each job.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry how many GPUs is there on each node? You can run multiple buildkite agent on the host and pin each to a GPU using environment variable. This can drastically help accelerate the test

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my main concern with this approach is now the way sharding is handling is implemented differently and can cause issues when developers are debugging the test failures on amd devices.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot use docker Buildkite plugin
Can I ask what's the reason? For for sharding, it is not necessary to use the plugin. Sharding is a native option. see https://buildkite.com/docs/tutorials/parallel-builds

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each node has 8 GPUs. The restarting procedure is indiscriminate though,- we're restarting all GPUs on a given node at once. This strategy has advantage of complete between-test decoupling. The unfortunate downside is that we can't rely on multiple Buildkite agents running on the same host.

We achieved the current level of HW stability with this approach.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Let's refine this PR a bit and we can merge it in

#replace shard arguments
commands=${@//"--shard-id= "/"--shard-id=${GPU} "}
commands=${commands//"--num-shards= "/"--num-shards=${PARALLEL_JOB_COUNT} "}
docker run \
--device /dev/kfd --device /dev/dri \
--network host \
--shm-size=16gb \
--rm \
-e HIP_VISIBLE_DEVICES=0 \
-e HIP_VISIBLE_DEVICES=${GPU} \
-e HF_TOKEN \
-v ${HF_CACHE}:${HF_MOUNT} \
-e HF_HOME=${HF_MOUNT} \
--name ${container_name} \
${image_name} \
/bin/bash -c "${commands}"

done
else
docker run \
--device /dev/kfd --device /dev/dri \
--network host \
--shm-size=16gb \
--rm \
-e HIP_VISIBLE_DEVICES=0 \
-e HF_TOKEN \
-v ${HF_CACHE}:${HF_MOUNT} \
-e HF_HOME=${HF_MOUNT} \
--name ${container_name} \
${image_name} \
/bin/bash -c "${commands}"
fi