Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Build/CI] Fixing 'docker run' to re-enable AMD CI tests. #4642

Merged
merged 9 commits into from
May 7, 2024

Conversation

Alexei-V-Ivanov-AMD
Copy link
Contributor

@Alexei-V-Ivanov-AMD Alexei-V-Ivanov-AMD commented May 7, 2024

This PR achieves the following goals:

  1. Corrects docker run interface to launch containers properly;

  2. Trims the number of AMD tests.

@@ -26,7 +26,7 @@ steps:
- label: "AMD: {{ step.label }}"
agents:
queue: amd
command: bash .buildkite/run-amd-test.sh "'cd {{ (step.working_dir or default_working_dir) | safe }} && {{ step.command or (step.commands | join(' && ')) | safe }}'"
command: bash .buildkite/run-amd-test.sh "cd {{ (step.working_dir or default_working_dir) | safe }} ; {{ step.command or (step.commands | join(" ; ")) | safe }}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not fail the bash command inside. If the test failed, the whole command will not exit with 1.

Copy link
Contributor Author

@Alexei-V-Ivanov-AMD Alexei-V-Ivanov-AMD May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked (https://buildkite.com/vllm/ci/builds/6722)
I does fail on the failing test inside.

Even on the partially failing test, it still fails. See e.g. "AMD: Speculative decoding tests" or "AMD: Models Test" or "AMD: Engine Test" in the above build.

@simon-mo simon-mo merged commit 478aed5 into vllm-project:main May 7, 2024
55 checks passed
@comaniac
Copy link
Collaborator

comaniac commented May 7, 2024

Looks like the AMD CI is broken after this PR? I saw the same error message in many CI runs for AMD tests:

Unable to open /dev/kfd read-write: Invalid argument
--
  | Failed to get user name to check for render group membership
  | 🚨 Error: The command exited with status 1

@Alexei-V-Ivanov-AMD
Copy link
Contributor Author

Alexei-V-Ivanov-AMD commented May 7, 2024

Looks like the AMD CI is broken after this PR? I saw the same error message in many CI runs for AMD tests:

No, this error is proven to be un-related to the present PR. We have definitely seen this issue before this PR.
The best current remedy is to request a "re-run" of the failed test. All CI nodes are at times affected, regardless of their HW or SW particularities. The error occurs in "bursts": if you see it coming, wait ~ 1 min and request a re-run.

@comaniac
Copy link
Collaborator

comaniac commented May 7, 2024

I see. Also I'm working on #4535 that changes AMD kernels a bit, but I keep seeing the compilation errors which I didn't see in the NVIDIA build. So I tried to find an existing success build for reference. If you have any idea about that error (https://buildkite.com/vllm/ci/builds/6803#018f550a-e708-4a0a-a48c-97a5a4d85a40/1106-1856) please let me know.

@Alexei-V-Ivanov-AMD
Copy link
Contributor Author

Alexei-V-Ivanov-AMD commented May 7, 2024

I see. Also I'm working on #4535 that changes AMD kernels a bit, but I keep seeing the compilation errors which I didn't see in the NVIDIA build. So I tried to find an existing success build for reference. If you have any idea about that error (https://buildkite.com/vllm/ci/builds/6803#018f550a-e708-4a0a-a48c-97a5a4d85a40/1106-1856) please let me know.

The error you're referring to appears to be a cmake error during the container build. It is apparently persistent through multiple attempts across different AMD tests in the referred build. It is definitely not related to the PR #4642, though, as rocm containers were getting built before it.

Cmake must have complained about something at some point above the final error message. To isolate and analyze the cause of the error during this CI build you'll need to make a fresh clone of the repo and then build a standard rocm docker container:

...vllm$ docker build -t {container name} -f Dockerfile.rocm .

That how it gets built in the CI anyway (

echo "--- Building container"
)

The stdout dump will give you plenty of information about your issue.

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants