Skip to content

SYSTEM PRESHUTDOWN command for graceful shutdown swarm node #852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: antalya-25.3
Choose a base branch
from

Conversation

ianton-ru
Copy link

@ianton-ru ianton-ru commented Jun 10, 2025

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

SYSTEM PRESHUTDOWN command for graceful shutdown swarm node

Documentation entry for user-facing changes

Solved #759
New command SYSTEM PRESHUTDOWN.
Scenario:
We want to scale down swarm cluster. On node which we want to shutdown we call SYSTEM PRESHUTDOWN, after that node stops to accept new distributed commands. It can still processed objects which is started processed before SYSTEM PRESHUTDOWN. When all that objects successfully processed, we can kill that node without any errors or lost data in responses on initiator.

After SYSTEM PRESHUTDOWN on swarm node:

  • unregister node from autodiscovery clusters if exists
  • stop getting new tasks for objectStorageCluster-family functions (s3Cluster/icebergCluster/etc.)

On initiator node:

  • for all distributed requests with setting skip_unavailable_shards=true unexpected closing of socket is legal if no data packets were accepted before. This allow to shutdown non-autodiscovery node too.

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

{
if (getContext()->isPreShutdownCalled())
Copy link
Collaborator

@ilejn ilejn Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true that operations prohibited in PreShutdown phase (e.g getting new tasks), are allowed in Shutdown phase?
If yes, is it correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand ClickHouse does not have specific shutdown phase. On SYSTEM SHUTDOWN just calls kill(0, SIGTERM). Without PRESHUTDOW this caused error on initiator as well as already taken but unfinished tasks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then what is the purpose of shutdown_called flag?

From the first glance I would expect that all checks that are true for preshutdown_called should be true for shutdown_called as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, understood, flag set in destructor.
Yes, make sense to set preshutdown there too.

@ianton-ru ianton-ru force-pushed the feature/system_preshutdown branch from a842ce6 to 9642c42 Compare June 11, 2025 09:17
@arthurpassos
Copy link
Collaborator

arthurpassos commented Jun 11, 2025

As I understand, regular queries would still be processed. Literally the only thing that's stopped is the "swarm node" work. Wouldn't it make more sense to rename the command to something more meaningful? (e.g UNREGISTER FROM SWARM)

This is just a question, not a change request

@ianton-ru
Copy link
Author

ianton-ru commented Jun 12, 2025

As I understand, regular queries would still be processed. Literally the only thing that's stopped is the "swarm node" work. Wouldn't it make more sense to rename the command to something more meaningful? (e.g UNREGISTER FROM SWARM)

It's a topic to discussion.
For now I don't have other use cases when we need and can do graceful shutdown from clickhouse side (shutdown main node is not that case, because errors are on client side in code which are out our control. May be possible to add special error "New query can't be executed, node prepared to shutdown", but it is useless without support on client side). But if we have one, I think that better to have single command PRESUTDOWN to prepare shutdown instead of several commands UNREGISTER FROM SWARM, STOP DOING SOMETHING, CLOSE SOMETHING ELSE, because all this hypothetical command make sense only when called in single pack. And better to add new logic under PRESHUTDOWN command.

@Enmk Enmk added antalya-25.3.3 swarms Antalya Roadmap: Swarms labels Jun 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants