Enable swarm queries to continue processing queries when a node is shut down

**Is your feature request related to a problem? Please describe.**
If you downsize a swarm cluster while queries are running, queries will fail due nodes shutting down mid-query. The query routing currently cannot handle losing a node in the swarm while the query is running.  You'll see an error like the following: 
```
SSELECT date, sum(output_count)
FROM ice.`aws-public-blockchain.btc`
WHERE date >= '2025-01-01' GROUP BY date ORDER BY date ASC
SETTINGS use_hive_partitioning = 1, object_storage_cluster = 'swarm',
input_format_parquet_use_metadata_cache = 1, enable_filesystem_cache = 1,
use_iceberg_metadata_files_cache = 1;
...
Received exception from server (version 25.2.2):
Code: 394. DB::Exception: Received from localhost:9000. DB::Exception: Received from chi-swarm-example-0-0-0.chi-swarm-example-0-0.antalya.svc.cluster.local:9000. DB::Exception: Query was cancelled. (QUERY_WAS_CANCELLED)
```

This type of problem can arise in at least 3 common cases. 
1. A swarm is being downsized and nodes are removed while the query is running. Currently the node remains registered in Keeper for a brief time after it disappears. 
2. A swarm node is running on a spot instance that is terminated by the cloud provider. 
3. A swarm node crashes during execution. 

**Describe the solution you'd like**
Query distribution to swarm clusters should be fault tolerant.  There are several cases to consider. Here's the initiator behavior that would be desirable. 

1. A node is gone when the subquery is submitted, resulting in a network error. In this case the initiator node should retry the node to see if it is really gone. If so, it should remove the node from the list of nodes available for dispatch and resubmit to a new node. 
2. A node fails before it can return results, resulting in an error. Same action as above. 
3. A node fails after it has returned some results. In this case we cannot know if the results are complete. The default should be to fail. We may also introduce a flag to ignore failures. In this case we would accept those results we received and remove the node from the list of available swarm nodes. 

Swarm nodes can help by refusing queries during shutdown due to swarm downscaling or spot instance termination. This decreases the chances of hitting case 3. 

**Describe alternatives you've considered**

"Just fail" is an alternative but a very bad one. This fault tolerance is required to permit use of spot instances as well as dynamic scaling of swarms. 

We can also dynamically refresh the node list during a query if nodes disappear. This would handle the case where we are using a subset of nodes and all of them disappear due to downsizing the swarm. 

**Additional context**
None at this time. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable swarm queries to continue processing queries when a node is shut down #759

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enable swarm queries to continue processing queries when a node is shut down #759

Description

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions