Skip to content

Add support for container restart policy#1258

Open
JaewonHur wants to merge 11 commits intoapple:mainfrom
JaewonHur:container-restart
Open

Add support for container restart policy#1258
JaewonHur wants to merge 11 commits intoapple:mainfrom
JaewonHur:container-restart

Conversation

@JaewonHur
Copy link
Contributor

@JaewonHur JaewonHur commented Feb 24, 2026

This PR initiates the implementation for adding restart feature (#286), which can be extended to support auto-start later (#158, #1201). The first milestone is to implement restart policies no, on-failure, and always (which behaves similar to docker-restart-policy).

To support restart policy, following persistent and runtime state variables should be added:

  • In ContainerCreateOption (persistent)
    • RestartPolicy: remember restart policy across the container life cycle.
  • In ContainerState (run-time)
    • StoppedState(startError: Error?, exitStatus: ExitStatus?): determine restart logic when container exits
    • manualStopped: Bool: remember if the container is stopped by user
      • True on first container creation, False on loadAtBoot
    • backOff: Duration?: remember backoff timer.
      • Default to nil
      • Set 100ms on first non-manual exit (and then doubled until 10 seconds)
      • Set nil on startError or manualStopped

A new state should be added in container life cycle:

  • restarting: restarted containers first go to this state and waits for backOff to start.

Implementation steps are described below:

  • [ContainersService] Add state variables for restart policy
  • [ContainerResource, ContainerCommand] Add restart policy and expose command line.
  • [ContainersService] Add restarting state, where the restarting containers first visit, and proceeds to bootstrap (after backOff) then startProcess.
  • [ContainersService] Implement restart logic in handleContainerExit
    • [restart == no]
      • Idle (stay at stopped).
    • [restart == on-failure]
      • startError || manualStopped || exitStatus == nil
        • Idle
      • exitStatues != nil && exitStatus != 0
        • Restart
    • [restart == always]
      • startError || manualStopped
        • Idle
      • Else
        • Restart
  • Add stability call, which marks the container to be stable 10 seconds after running---i.e., resetting the backOff.
    • It might be something like stabilityMonitor (similar to ExitMonitor) that registers callback on successful startProcess.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update

Motivation and Context

Users can setup auto-restarting containers using this feature.

Testing

  • Tested locally
  • Added/updated tests
  • Added/updated docs

Containers move to .bootstrapped state after assigning SandboxClient.
Following invariant holds:
 1. If .stopped, state.client = nil
 2. If .bootstrapped, state.client != nil.

It'd better to tightly couple the state and client.
Then, add `restarting` state to ContainerStatus
ContainersService spawn an unstructured Task for restart scheduler,
which receives stopped container ids through a async queue and restart
them if matching conditions.

While it doesn't run under lock, it doesn't introduce a new level of race, as
all the actual operations changing container state (e.g., `bootstrap`)
are lock protected.
This reuses ExitMonitor, which was not initially for this. It might be
better to come up with generic Monitor structure?
@saehejkang
Copy link
Contributor

@JaewonHur Busy week and I hope to get to review this here soon. I saw your comment and I am all for following the direction we take in this PR. Though it might be worthwhile to get started on the discussion on how we want to handle test cases and maybe work together on that piece?

doStop cannot guarantee the container is stopped in short term.
@JaewonHur
Copy link
Contributor Author

Hi @saehejkang, sorry for moving forward on my own.

I just did minimal implementation today, but I'll definitely put our name together here.
This PR now just contains prototype implementation, and anything can be changed! Please take a look :)

Further things to do might be:

  1. supporting --restart systemStart (or make always to start on system start?)
  2. Add restart counter for onFailure.
  3. Also, testing for restarting container across system can be added.

Copy link
Contributor

@saehejkang saehejkang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went over things as best as I could for now. Lots of changes that I have not gotten a chance to dig deeper into (the scheduler). Don't worry about taking over and wrapping this up!

}

public static let `default` = ContainerCreateOptions(autoRemove: false)
public init(from decoder: Decoder) throws {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was the decoder not being used before when it was just autoRemove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually have no idea :)
Maybe, wanted to keep the code as simple as possible at that time?

case stopped
/// The object is waiting to be restarted.
case restarting
/// The object is currently bootstrapped.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the bootstrapped status mean? The container or sandbox is stopped but ready to be started?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bootstrapped means (by itself) the container is bootstrapped (i.e., bootstrap function has been successfully executed on that container).

Before this update, the state was same stopped whether it is bootstrapped or not. The only difference was SandboxClient is allocated or not---I personally felt that somewhat weird, but left there.

But after adding restarting state, it becomes more weird, since container state should go restarting (waiting for backOff), then stopped (after bootstrap) then running`.

BTW, now we split the RuntimeStatus to ContainerStatus and SandboxStatus, so it only applies to containers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While writing this comment, I found that I missed setting the container restarting in scheduler lol.

return results
}

public func runRestartScheduler() throws {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never thought I would see a scheduler again after OS in college 😆.

@saehejkang
Copy link
Contributor

supporting --restart systemStart (or make always to start on system start?)

Solid foundation work was done with the restart policy and I am sure it could follow the same pattern?

Add restart counter for onFailure.

What would be the use case for this? Could it be used to automatically stop trying to restart a container after a certain count?

Also, testing for restarting container across system can be added.

Hopefully we can use something like #1176 to be a sort of final e2e test of all the policies.

@JaewonHur
Copy link
Contributor Author

Solid foundation work was done with the restart policy and I am sure it could follow the same pattern?

Thanks, if you make PR to my branch, let me merge it :)

What would be the use case for this? Could it be used to automatically stop trying to restart a container after a certain count?

Yes. Docker supports that functionality (https://docs.docker.com/engine/containers/start-containers-automatically/)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants