Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: AMD/ROCm builds for the rocm detector #9796

Merged
merged 1 commit into from
Feb 12, 2024

Conversation

harakas
Copy link
Contributor

@harakas harakas commented Feb 11, 2024

Github CI builds for the AMD/ROCm platform. Should create following builds:

  • buildtag-rocm
  • buildtag-rocm-gfx900
  • buildtag-rocm-gfx1030
  • buildtag-rocm-gfx1100

I have no experience with github continuous integration and copy-pasted existing and recommended lines. Have not tested it. Please review carefully. @NickM-27

Copy link

netlify bot commented Feb 11, 2024

Deploy Preview for frigate-docs ready!

Name Link
🔨 Latest commit d5ee084
🔍 Latest deploy log https://app.netlify.com/sites/frigate-docs/deploys/65c8c7a6b12f7e00088b3bf9
😎 Deploy Preview https://deploy-preview-9796--frigate-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@NickM-27
Copy link
Collaborator

so just to clarify, the docker file uses the env variables that are set to decide what to build, is that correct? looks like the same target is being built just with different env vars

@harakas
Copy link
Contributor Author

harakas commented Feb 11, 2024

so just to clarify, the docker file uses the env variables that are set to decide what to build, is that correct? looks like the same target is being built just with different env vars

No, the env is converted into Dockerfile arguments in docker/rocm/rocm.hcl. The hcl reads variables from the environment and passes them on as arguments (described by ARG keyword) to the Dockerfile.

@blakeblackshear blakeblackshear merged commit 09153a1 into blakeblackshear:dev Feb 12, 2024
10 checks passed
@harakas
Copy link
Contributor Author

harakas commented Feb 12, 2024

The CI failed:

https://github.com/blakeblackshear/frigate/actions/runs/7871668878
https://github.com/blakeblackshear/frigate/actions/runs/7871668878/job/21475415538

"No space left on device"? But that happened already before reaching the rocm build at tensorrt build? First link does not mention where error happened and the other does not mention the error but where it stopped. Strange.

Who can investigate and fix this? @NickM-27

@harakas
Copy link
Contributor Author

harakas commented Feb 12, 2024

Thinking about it. Tensorrt and cuda is probably filling up the docker cache. Perhaps a solution would be to clean up the docker cache. A bit of a waste but easiest would be to prune the system. Something like this.

diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index f2cdc91a..34274030 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -47,6 +47,9 @@ jobs:
             tensorrt.tags=${{ steps.setup.outputs.image-name }}-tensorrt
             *.cache-from=type=registry,ref=${{ steps.setup.outputs.cache-name }}-amd64
             *.cache-to=type=registry,ref=${{ steps.setup.outputs.cache-name }}-amd64,mode=max
+      - name: Prune docker cache
+        run: |
+          docker system prune -a -f
       - name: AMD/ROCm general build
         env:
           AMDGPU: gfx

@NickM-27
Copy link
Collaborator

I don't see how the AMD flows would cause the tensorrt to fail. Also because the AMD builds don't write to the cache so shouldn't make a difference. Most likely some old images just need to be cleared out

@harakas
Copy link
Contributor Author

harakas commented Feb 12, 2024

There's more logs now: https://github.com/blakeblackshear/frigate/actions/runs/7871668878/job/21479613695 -- disk ran out in the general rocm build after tensorrt had completed.

Dockerfile:50
--------------------
  48 |     ARG ROCM
  49 |     
  50 | >>> COPY --from=rocm /opt/rocm-$ROCM /opt/rocm-$ROCM
  51 |     RUN ln -s /opt/rocm-$ROCM /opt/rocm
  52 |     
--------------------
ERROR: failed to solve: failed to copy files: copy file range failed: no space left on device
Error: buildx bake failed with: ERROR: failed to solve: failed to copy files: copy file range failed: no space left on device

@NickM-27
Copy link
Collaborator

perhaps the single image is too large. Maybe could try only running the specific models and not the general one

@harakas
Copy link
Contributor Author

harakas commented Feb 13, 2024

ROCm is quite bloated. The installation (where from libs are copied to reduce image size) itself is 17GB already.

$ du -sk /opt/rocm/
17593312	/opt/rocm/
$ docker image ls|grep rocm
frigate                           latest-rocm           c75fd9285b04   3 minutes ago    7.47GB
frigate                           latest-rocm-gfx1100   0e74277ba99e   3 minutes ago    3.64GB
frigate                           latest-rocm-gfx1030   66e424f74954   3 minutes ago    3.55GB
frigate                           latest-rocm-gfx900    ac68fe8d2510   4 minutes ago    3.48GB

Are you sure the local docker layer cache is cleaned between steps? Nothing about it in the logs. Might be worth to try the docker system prune -af I suggested. But I don't know much about how github actions or build environment work -- I have no experience with them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants