dev: docker qol changes by carmichaelong · Pull Request #237 · opencap-org/opencap-core

carmichaelong · 2025-03-04T01:01:01Z

Addresses #227, #233, #234 (and a related openpose docker issue).

Use docker's restart on-failure mechanism to retry 3 times for openpose and mmpose containers
Add versioning to "Pulling trials..." message
Retry test trial additionally with URLErrors
Add a similar try/catch to the openpose docker loop that mmpose has. This might have caught a recent issue with an openpose docker container going down.

Most testing is straightforward, but testing the on-failure mechanism is less straightforward. In case it's helpful, steps I took:

Use sudo kill -9 <PID> to send an exit signal to the process running the docker container.
To find the PID, I used ps -ef to list the running processes. Two ways to find it:

Start with finding the related mmpose or openpose container ID using docker ps. Then, find the process in ps -ef that contains that container ID.
When running ps -ef, there will be processes related to /mmpose/loop_mmpose.py and /openpose/loop_openpose.py. You can see what processes they depend on as well, and find the PID that way (it will be the same as option 1 if traced correctly).

I tested restarts at any time by using sudo kill with no trials running.
I also tested by reprocessing a trial, using sudo kill to stop the mmpose container and cause a processing error. Then, I reprocessed the trial again and let it run to completion.

main <- dev

…ailure

AlbertoCasasOrtiz · 2025-03-04T21:37:31Z

Use docker's restart on-failure mechanism to retry 3 times for openpose and mmpose containers - Tested by killing each 3 times and checking it was restarted. The fourth time it would not be restarted. The opencap-local is never restarted as expected.
Add versioning to "Pulling trials..." message - Checked in the docker console logs.
Retry test trial additionally with URLErrors - Tested by raising URLError manually
Add a similar try/catch to the openpose docker loop that mmpose has. This might have caught a recent issue with an openpose docker container going down. - Tested by running a few trials with openpose.

Questions:

What happens with opencap docker container, if either openpose or mmpose are killed the third time? I supose it will keep accepting trials.
What approach did you follow to test the try/catch in the openpose docker loop? I may have missed something.

AlbertoCasasOrtiz

LGTM! See questions in my previous comment.

carmichaelong · 2025-03-04T22:42:22Z

What happens with opencap docker container, if either openpose or mmpose are killed the third time? I supose it will keep accepting trials.

This is true, and mimics what happened before. I wasn't sure if we wanted to restart the container indefinitely, so I went with 3 somewhat arbitrarily. There could be some better changes to stop a container if another container is stopped, but it doesn't seem super straightforward. In theory, the test trial should be able to pick up on that, but would have to figure out how that could work better (maybe in another PR).

What approach did you follow to test the try/catch in the openpose docker loop? I may have missed something.

Good question, it was more of a quick test where I just added an Exception in the loop code. Open to ideas for better testing, though.

AlbertoCasasOrtiz · 2025-03-06T06:16:43Z

This is true, and mimics what happened before. I wasn't sure if we wanted to restart the container indefinitely, so I went with 3 somewhat arbitrarily. There could be some better changes to stop a container if another container is stopped, but it doesn't seem super straightforward. In theory, the test trial should be able to pick up on that, but would have to figure out how that could work better (maybe in another PR).

I agree, this is way better that what we had before. Let's worry about that in another PR.

Good question, it was more of a quick test where I just added an Exception in the loop code. Open to ideas for better testing, though.

That's what I did, plus processing a few trials with openpose to make sure it works properly. I think that should be enough given it was working on hrnet and seems to work on openpose now.

AlbertoCasasOrtiz and others added 5 commits February 19, 2025 10:59

Merge pull request #231 from stanfordnmbl/dev

c10809f

main <- dev

make the error log human readable

a673646

attempt to restart openpose and mmpose docker containers 3 times on f…

e50b923

…ailure

add commit has to pulling trials message

5f47cba

also retry test trial on URLError

218ce69

carmichaelong requested a review from AlbertoCasasOrtiz March 4, 2025 01:01

add try/catch to openpose loop

13d4f0c

AlbertoCasasOrtiz approved these changes Mar 4, 2025

View reviewed changes

carmichaelong merged commit a38e1a7 into dev Mar 11, 2025

carmichaelong deleted the docker-stability branch March 11, 2025 17:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dev: docker qol changes#237

dev: docker qol changes#237
carmichaelong merged 6 commits intodevfrom
docker-stability

carmichaelong commented Mar 4, 2025

Uh oh!

AlbertoCasasOrtiz commented Mar 4, 2025

Uh oh!

AlbertoCasasOrtiz left a comment

Uh oh!

carmichaelong commented Mar 4, 2025 •

edited

Loading

Uh oh!

AlbertoCasasOrtiz commented Mar 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

carmichaelong commented Mar 4, 2025

Uh oh!

AlbertoCasasOrtiz commented Mar 4, 2025

Uh oh!

AlbertoCasasOrtiz left a comment

Choose a reason for hiding this comment

Uh oh!

carmichaelong commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlbertoCasasOrtiz commented Mar 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

carmichaelong commented Mar 4, 2025 •

edited

Loading