Skip to content

Enable swarm queries to continue processing queries when a node is shut down #759

Open
0 of 1 issue completed
@hodgesrm

Description

@hodgesrm

Is your feature request related to a problem? Please describe.
If you downsize a swarm cluster while queries are running, queries will fail due nodes shutting down mid-query. The query routing currently cannot handle losing a node in the swarm while the query is running. You'll see an error like the following:

SSELECT date, sum(output_count)
FROM ice.`aws-public-blockchain.btc`
WHERE date >= '2025-01-01' GROUP BY date ORDER BY date ASC
SETTINGS use_hive_partitioning = 1, object_storage_cluster = 'swarm',
input_format_parquet_use_metadata_cache = 1, enable_filesystem_cache = 1,
use_iceberg_metadata_files_cache = 1;
...
Received exception from server (version 25.2.2):
Code: 394. DB::Exception: Received from localhost:9000. DB::Exception: Received from chi-swarm-example-0-0-0.chi-swarm-example-0-0.antalya.svc.cluster.local:9000. DB::Exception: Query was cancelled. (QUERY_WAS_CANCELLED)

This type of problem can arise in at least 3 common cases.

  1. A swarm is being downsized and nodes are removed while the query is running. Currently the node remains registered in Keeper for a brief time after it disappears.
  2. A swarm node is running on a spot instance that is terminated by the cloud provider.
  3. A swarm node crashes during execution.

Describe the solution you'd like
Query distribution to swarm clusters should be fault tolerant. There are several cases to consider. Here's the initiator behavior that would be desirable.

  1. A node is gone when the subquery is submitted, resulting in a network error. In this case the initiator node should retry the node to see if it is really gone. If so, it should remove the node from the list of nodes available for dispatch and resubmit to a new node.
  2. A node fails before it can return results, resulting in an error. Same action as above.
  3. A node fails after it has returned some results. In this case we cannot know if the results are complete. The default should be to fail. We may also introduce a flag to ignore failures. In this case we would accept those results we received and remove the node from the list of available swarm nodes.

Swarm nodes can help by refusing queries during shutdown due to swarm downscaling or spot instance termination. This decreases the chances of hitting case 3.

Describe alternatives you've considered

"Just fail" is an alternative but a very bad one. This fault tolerance is required to permit use of spot instances as well as dynamic scaling of swarms.

We can also dynamically refresh the node list during a query if nodes disappear. This would handle the case where we are using a subset of nodes and all of them disappear due to downsizing the swarm.

Additional context
None at this time.

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions