ci: AMD/ROCm builds for the rocm detector #9796

harakas · 2024-02-11T13:12:03Z

Github CI builds for the AMD/ROCm platform. Should create following builds:

buildtag-rocm
buildtag-rocm-gfx900
buildtag-rocm-gfx1030
buildtag-rocm-gfx1100

I have no experience with github continuous integration and copy-pasted existing and recommended lines. Have not tested it. Please review carefully. @NickM-27

netlify · 2024-02-11T13:12:07Z

✅ Deploy Preview for frigate-docs ready!

Name	Link
🔨 Latest commit	`d5ee084`
🔍 Latest deploy log	https://app.netlify.com/sites/frigate-docs/deploys/65c8c7a6b12f7e00088b3bf9
😎 Deploy Preview	https://deploy-preview-9796--frigate-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

NickM-27 · 2024-02-11T13:17:59Z

so just to clarify, the docker file uses the env variables that are set to decide what to build, is that correct? looks like the same target is being built just with different env vars

harakas · 2024-02-11T14:11:34Z

so just to clarify, the docker file uses the env variables that are set to decide what to build, is that correct? looks like the same target is being built just with different env vars

No, the env is converted into Dockerfile arguments in docker/rocm/rocm.hcl. The hcl reads variables from the environment and passes them on as arguments (described by ARG keyword) to the Dockerfile.

harakas · 2024-02-12T14:05:22Z

The CI failed:

https://github.com/blakeblackshear/frigate/actions/runs/7871668878
https://github.com/blakeblackshear/frigate/actions/runs/7871668878/job/21475415538

"No space left on device"? But that happened already before reaching the rocm build at tensorrt build? First link does not mention where error happened and the other does not mention the error but where it stopped. Strange.

Who can investigate and fix this? @NickM-27

harakas · 2024-02-12T20:40:19Z

Thinking about it. Tensorrt and cuda is probably filling up the docker cache. Perhaps a solution would be to clean up the docker cache. A bit of a waste but easiest would be to prune the system. Something like this.

diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index f2cdc91a..34274030 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -47,6 +47,9 @@ jobs:
             tensorrt.tags=${{ steps.setup.outputs.image-name }}-tensorrt
             *.cache-from=type=registry,ref=${{ steps.setup.outputs.cache-name }}-amd64
             *.cache-to=type=registry,ref=${{ steps.setup.outputs.cache-name }}-amd64,mode=max
+      - name: Prune docker cache
+        run: |
+          docker system prune -a -f
       - name: AMD/ROCm general build
         env:
           AMDGPU: gfx

NickM-27 · 2024-02-12T20:47:34Z

I don't see how the AMD flows would cause the tensorrt to fail. Also because the AMD builds don't write to the cache so shouldn't make a difference. Most likely some old images just need to be cleared out

harakas · 2024-02-12T21:43:36Z

There's more logs now: https://github.com/blakeblackshear/frigate/actions/runs/7871668878/job/21479613695 -- disk ran out in the general rocm build after tensorrt had completed.

Dockerfile:50
--------------------
  48 |     ARG ROCM
  49 |     
  50 | >>> COPY --from=rocm /opt/rocm-$ROCM /opt/rocm-$ROCM
  51 |     RUN ln -s /opt/rocm-$ROCM /opt/rocm
  52 |     
--------------------
ERROR: failed to solve: failed to copy files: copy file range failed: no space left on device
Error: buildx bake failed with: ERROR: failed to solve: failed to copy files: copy file range failed: no space left on device

NickM-27 · 2024-02-12T22:11:30Z

perhaps the single image is too large. Maybe could try only running the specific models and not the general one

harakas · 2024-02-13T11:07:49Z

ROCm is quite bloated. The installation (where from libs are copied to reduce image size) itself is 17GB already.

$ du -sk /opt/rocm/
17593312	/opt/rocm/
$ docker image ls|grep rocm
frigate                           latest-rocm           c75fd9285b04   3 minutes ago    7.47GB
frigate                           latest-rocm-gfx1100   0e74277ba99e   3 minutes ago    3.64GB
frigate                           latest-rocm-gfx1030   66e424f74954   3 minutes ago    3.55GB
frigate                           latest-rocm-gfx900    ac68fe8d2510   4 minutes ago    3.48GB

Are you sure the local docker layer cache is cleaned between steps? Nothing about it in the logs. Might be worth to try the docker system prune -af I suggested. But I don't know much about how github actions or build environment work -- I have no experience with them.

ci: rocm builds

d5ee084

NickM-27 approved these changes Feb 11, 2024

View reviewed changes

blakeblackshear merged commit 09153a1 into blakeblackshear:dev Feb 12, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: AMD/ROCm builds for the rocm detector #9796

ci: AMD/ROCm builds for the rocm detector #9796

harakas commented Feb 11, 2024

netlify bot commented Feb 11, 2024 •

edited

Loading

NickM-27 commented Feb 11, 2024

harakas commented Feb 11, 2024

harakas commented Feb 12, 2024

harakas commented Feb 12, 2024

NickM-27 commented Feb 12, 2024

harakas commented Feb 12, 2024

NickM-27 commented Feb 12, 2024

harakas commented Feb 13, 2024

ci: AMD/ROCm builds for the rocm detector #9796

ci: AMD/ROCm builds for the rocm detector #9796

Conversation

harakas commented Feb 11, 2024

netlify bot commented Feb 11, 2024 • edited Loading

✅ Deploy Preview for frigate-docs ready!

NickM-27 commented Feb 11, 2024

harakas commented Feb 11, 2024

harakas commented Feb 12, 2024

harakas commented Feb 12, 2024

NickM-27 commented Feb 12, 2024

harakas commented Feb 12, 2024

NickM-27 commented Feb 12, 2024

harakas commented Feb 13, 2024

netlify bot commented Feb 11, 2024 •

edited

Loading