-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: AMD/ROCm builds for the rocm detector #9796
ci: AMD/ROCm builds for the rocm detector #9796
Conversation
✅ Deploy Preview for frigate-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
so just to clarify, the docker file uses the env variables that are set to decide what to build, is that correct? looks like the same target is being built just with different env vars |
No, the env is converted into Dockerfile arguments in docker/rocm/rocm.hcl. The hcl reads variables from the environment and passes them on as arguments (described by |
The CI failed: https://github.com/blakeblackshear/frigate/actions/runs/7871668878 "No space left on device"? But that happened already before reaching the rocm build at tensorrt build? First link does not mention where error happened and the other does not mention the error but where it stopped. Strange. Who can investigate and fix this? @NickM-27 |
Thinking about it. Tensorrt and cuda is probably filling up the docker cache. Perhaps a solution would be to clean up the docker cache. A bit of a waste but easiest would be to prune the system. Something like this. diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index f2cdc91a..34274030 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -47,6 +47,9 @@ jobs:
tensorrt.tags=${{ steps.setup.outputs.image-name }}-tensorrt
*.cache-from=type=registry,ref=${{ steps.setup.outputs.cache-name }}-amd64
*.cache-to=type=registry,ref=${{ steps.setup.outputs.cache-name }}-amd64,mode=max
+ - name: Prune docker cache
+ run: |
+ docker system prune -a -f
- name: AMD/ROCm general build
env:
AMDGPU: gfx |
I don't see how the AMD flows would cause the tensorrt to fail. Also because the AMD builds don't write to the cache so shouldn't make a difference. Most likely some old images just need to be cleared out |
There's more logs now: https://github.com/blakeblackshear/frigate/actions/runs/7871668878/job/21479613695 -- disk ran out in the general rocm build after tensorrt had completed.
|
perhaps the single image is too large. Maybe could try only running the specific models and not the general one |
ROCm is quite bloated. The installation (where from libs are copied to reduce image size) itself is 17GB already. $ du -sk /opt/rocm/
17593312 /opt/rocm/
$ docker image ls|grep rocm
frigate latest-rocm c75fd9285b04 3 minutes ago 7.47GB
frigate latest-rocm-gfx1100 0e74277ba99e 3 minutes ago 3.64GB
frigate latest-rocm-gfx1030 66e424f74954 3 minutes ago 3.55GB
frigate latest-rocm-gfx900 ac68fe8d2510 4 minutes ago 3.48GB Are you sure the local docker layer cache is cleaned between steps? Nothing about it in the logs. Might be worth to try the |
Github CI builds for the AMD/ROCm platform. Should create following builds:
I have no experience with github continuous integration and copy-pasted existing and recommended lines. Have not tested it. Please review carefully. @NickM-27