Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set Termination Grace Period to write_timeout for functions to allow them to complete during a scale down event. #869

Merged
merged 3 commits into from
Nov 4, 2021

Conversation

alexellis
Copy link
Member

@alexellis alexellis commented Nov 4, 2021

Description

Set Termination Grace Period to write_timeout for functions to allow them to complete during a scale down event.

Motivation and Context

Prior to this change, a function could only drain for 30 seconds, then Kubernetes would forcibly kill it. After this change, in--flight function invocations can drain according to the write_timeout environment variable.

For example, this function would have failed to complete in-flight requests that ran for 1m30s.

functions:
  go-long:
    lang: golang-middleware
    handler: ./go-long
    image: alexellis2/graceful:0.2.1-newest
    environment:
      write_timeout: 2m
      read_timeout: 2m
      exec_timeout: 2m
      handler_wait_duration: 1m30s
      healthcheck_interval: 5s
    annotations:
      topic: "pipeline.subscription"
    labels:
      com.openfaas.scale.min: 1
      com.openfaas.scale.max: 1

Fixes #853 #637

How Has This Been Tested?

I deployed openfaas with global timeouts of 2m as per the "Extended timeout" tutorial in the docs.

Then I deployed the go-long function with a 1m wait and 2m max write_timeout/exec_timeout.

Then I used hey and an extended -t (timeout flag) to schedule 5 requests.

At this point I scale down the function with kubectl scale and monitored the logs.

image (9)

Normally, the default of 30 seconds would have caused the functions to exit, but instead they stayed running and the pod remained in a terminating status.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I've read the CONTRIBUTION guide
  • I have signed-off my commits with git commit -s
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Also related to this change is a watchdog update to allow the Pod to exit early, if all in-flight requests have completed:

openfaas/of-watchdog@31e1d32

Configures the Pod's termination grace period in seconds to
the write_timeout environment-variable, when available.
Otherwise, it's set to the default for Kubernetes, which is
30 seconds.

This change is important for allowing function containers
to drain active connections, without losing work in
progress.

Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>
Allow container registry to be overridden easily for
local builds

Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>
@alexellis
Copy link
Member Author

To test this patch see: How I test OpenFaaS changes with Kubernetes

Copy link
Member

@LucasRoesler LucasRoesler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question about how we structure the code, but it otherwise looks good. If you don't want to split it into a helper, just let me know and I will approve the PR

@alexellis
Copy link
Member Author

@LucasRoesler I think you also suggested adding 500ms +/- to the default write timeout for jitter.

The additional 2s should prevent an issue where the grace
period is exactly the same as the timeout, and kills the Pod
before the remaining requests have completed.

Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>
@alexellis
Copy link
Member Author

@kevin-lindsay-1 would you like to take this for a spin?

You may need to delete your functions to try this.

@alexellis alexellis self-assigned this Nov 4, 2021
@kevin-lindsay-1
Copy link
Contributor

kevin-lindsay-1 commented Nov 4, 2021

@alexellis works good for me.

My testing procedure used Tilt to avoid needing to push the image to a registry

For the sake of anyone interested, here's how I did it:

  1. copy your existing deployment's yaml from k8s, save it to a file named gateway-dep.yaml
  2. change the image for the faas-netes container to simply be faas-netes; tilt will figure it out
  3. put the file in the faas-netes project root
  4. create a Tiltfile with the following contents:
# copied a `gateway-dep.yaml` from an existing deployment
k8s_yaml('./gateway-dep.yaml')

# build the faas-netes image
docker_build('faas-netes', '.')

# hack to copy over the .git folder
# https://github.com/tilt-dev/tilt/issues/2169
local_resource(
    'hack-copy_git',
    cmd='cp -R .git .newgit'
)

# deploy the gateway in kubernetes, tilt detects the newly built image
k8s_resource(
    'gateway',
    resource_deps=[
        'hack-copy_git'
    ]
)
  1. update the docker image to RUN mv .newgit .git after the COPY . .
  2. tilt up
  3. delete function(s), redeploy them to see changes

For anyone who might want to use this as an example, please note that tilt recently added a kubectl_build command which uses buildkit inside of k8s, so you don't have to duplicate images on your machine. Haven't used it yet, but probably will.

@alexellis
Copy link
Member Author

@alexellis works good for me.

💪

Thank you Kevin

@alexellis alexellis merged commit f0a81e6 into master Nov 4, 2021
@alexellis alexellis deleted the openfaasltd/grace-period-from-write-timeout branch November 4, 2021 17:10
alexellis added a commit that referenced this pull request Nov 5, 2021
This was missed from #869 and fixes the controller so that
it updates the termination grace period for functions.

The problem would be that if a user updated the
write_timeout variable used to compute the grace
period, it would be ignored.

The operator uses common code for deploy/update, so does
not need a separate change.

Thanks to @kevin-lindsay-1 for doing exploratory testing
to find this.

Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>
alexellis added a commit that referenced this pull request Nov 5, 2021
This was missed from #869 and fixes the controller so that
it updates the termination grace period for functions.

The problem would be that if a user updated the
write_timeout variable used to compute the grace
period, it would be ignored.

The operator uses common code for deploy/update, so does
not need a separate change.

Thanks to @kevin-lindsay-1 for doing exploratory testing
to find this.

Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Functions being created w/ default TerminationGracePeriodSeconds
3 participants