Skip to content

Conversation

@rtibbles
Copy link
Member

@rtibbles rtibbles commented Oct 30, 2025

Summary

  • Add worker_soft_shutdown_timeout (28s) to Celery config in settings.py
  • Add REMAP_SIGTERM=SIGQUIT to K8s shared env vars to trigger soft shutdown
  • Add CELERY_WORKER_SOFT_SHUTDOWN_TIMEOUT=28 env var for configurability
  • Add same env vars to docker-compose for dev environment consistency
  • Set terminationGracePeriodSeconds: 30 explicitly in worker deployment

When K8s sends SIGTERM during pod termination, workers will now:

  1. Stop accepting new tasks
  2. Continue processing current task for up to 28 seconds
  3. Exit cleanly if task completes, or timeout after 28s
  4. Allow K8s 2s buffer before 30s grace period expires

❗ 🤖 Generated with Claude Code

References

Fixes #5000

Reviewer guidance

Does this match the spec outlined by @bjester in the issue above? I think it does - noting that we have already upgraded to a compatible version of celery https://github.com/learningequality/studio/blob/hotfixes/requirements.txt#L31

Lastly - for manual testing - what happens when a long running publish is interrupted by this? Does it get properly rerun? I think it should because the "change event" should still be marked as unapplied, and not errored, but would be good to check. It will be a bit of a bummer that it will start from the beginning again, but it's better than not happening at all!

claude and others added 2 commits October 30, 2025 00:07
Implements soft shutdown feature from Celery 5.5.3 to prevent task
interruption during pod termination. Resolves #5000.

Changes:
- Add worker_soft_shutdown_timeout (28s) to Celery config in settings.py
- Add REMAP_SIGTERM=SIGQUIT to K8s shared env vars to trigger soft shutdown
- Add CELERY_WORKER_SOFT_SHUTDOWN_TIMEOUT=28 env var for configurability
- Add same env vars to docker-compose for dev environment consistency
- Set terminationGracePeriodSeconds: 30 explicitly in worker deployment

When K8s sends SIGTERM during pod termination, workers will now:
1. Stop accepting new tasks
2. Continue processing current task for up to 28 seconds
3. Exit cleanly if task completes, or timeout after 28s
4. Allow K8s 2s buffer before 30s grace period expires

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
CELERY_RESULT_BACKEND_ENDPOINT: redis
CELERY_REDIS_PASSWORD: ""
REMAP_SIGTERM: "SIGQUIT"
CELERY_WORKER_SOFT_SHUTDOWN_TIMEOUT: "28"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're already defaulting to 28, this adds no value.

Copy link
Member

@bjester bjester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of CELERY_WORKER_SOFT_SHUTDOWN_TIMEOUT rightfully needs coordination with the k8s graceful shutdown, like was done with the unused infra code here, but without a coordinating change with docker's timeout, setting it to the already default value in docker-compose.yml feels unnecessary. Google says Docker's default timeout should be 10 seconds. I suggest either we remove it from the docker compose file, or coordinate it with docker's timeout.

Lastly, as mentioned on Slack, we don't really use the infra code that was changed here, so we need to work with infra to implement those changes in the right place. The changes here could stay, or be removed 🤷

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants