Skip to content

dev: docker qol changes#237

Merged
carmichaelong merged 6 commits intodevfrom
docker-stability
Mar 11, 2025
Merged

dev: docker qol changes#237
carmichaelong merged 6 commits intodevfrom
docker-stability

Conversation

@carmichaelong
Copy link
Contributor

Addresses #227, #233, #234 (and a related openpose docker issue).

  • Use docker's restart on-failure mechanism to retry 3 times for openpose and mmpose containers
  • Add versioning to "Pulling trials..." message
  • Retry test trial additionally with URLErrors
  • Add a similar try/catch to the openpose docker loop that mmpose has. This might have caught a recent issue with an openpose docker container going down.

Most testing is straightforward, but testing the on-failure mechanism is less straightforward. In case it's helpful, steps I took:

  • Use sudo kill -9 <PID> to send an exit signal to the process running the docker container.
  • To find the PID, I used ps -ef to list the running processes. Two ways to find it:
  1. Start with finding the related mmpose or openpose container ID using docker ps. Then, find the process in ps -ef that contains that container ID.
  2. When running ps -ef, there will be processes related to /mmpose/loop_mmpose.py and /openpose/loop_openpose.py. You can see what processes they depend on as well, and find the PID that way (it will be the same as option 1 if traced correctly).
  • I tested restarts at any time by using sudo kill with no trials running.
  • I also tested by reprocessing a trial, using sudo kill to stop the mmpose container and cause a processing error. Then, I reprocessed the trial again and let it run to completion.

@AlbertoCasasOrtiz
Copy link
Member

  • Use docker's restart on-failure mechanism to retry 3 times for openpose and mmpose containers - Tested by killing each 3 times and checking it was restarted. The fourth time it would not be restarted. The opencap-local is never restarted as expected.
  • Add versioning to "Pulling trials..." message - Checked in the docker console logs.
  • Retry test trial additionally with URLErrors - Tested by raising URLError manually
  • Add a similar try/catch to the openpose docker loop that mmpose has. This might have caught a recent issue with an openpose docker container going down. - Tested by running a few trials with openpose.

Questions:

  • What happens with opencap docker container, if either openpose or mmpose are killed the third time? I supose it will keep accepting trials.
  • What approach did you follow to test the try/catch in the openpose docker loop? I may have missed something.

Copy link
Member

@AlbertoCasasOrtiz AlbertoCasasOrtiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! See questions in my previous comment.

@carmichaelong
Copy link
Contributor Author

carmichaelong commented Mar 4, 2025

What happens with opencap docker container, if either openpose or mmpose are killed the third time? I supose it will keep accepting trials.

This is true, and mimics what happened before. I wasn't sure if we wanted to restart the container indefinitely, so I went with 3 somewhat arbitrarily. There could be some better changes to stop a container if another container is stopped, but it doesn't seem super straightforward. In theory, the test trial should be able to pick up on that, but would have to figure out how that could work better (maybe in another PR).

What approach did you follow to test the try/catch in the openpose docker loop? I may have missed something.

Good question, it was more of a quick test where I just added an Exception in the loop code. Open to ideas for better testing, though.

@AlbertoCasasOrtiz
Copy link
Member

This is true, and mimics what happened before. I wasn't sure if we wanted to restart the container indefinitely, so I went with 3 somewhat arbitrarily. There could be some better changes to stop a container if another container is stopped, but it doesn't seem super straightforward. In theory, the test trial should be able to pick up on that, but would have to figure out how that could work better (maybe in another PR).

I agree, this is way better that what we had before. Let's worry about that in another PR.

Good question, it was more of a quick test where I just added an Exception in the loop code. Open to ideas for better testing, though.

That's what I did, plus processing a few trials with openpose to make sure it works properly. I think that should be enough given it was working on hrnet and seems to work on openpose now.

@carmichaelong carmichaelong merged commit a38e1a7 into dev Mar 11, 2025
@carmichaelong carmichaelong deleted the docker-stability branch March 11, 2025 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments