Restart cluster tasks on connection lost #780

ianton-ru · 2025-05-14T16:18:01Z

Changelog category (leave one):

Experimental Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Restart loading objects on other nodes when one node down in cluster request.

Documentation entry for user-facing changes

When swarm node takes some tasks to execute and goes down, if no data from this node were processed, task can be returned back to common queue to be executed on other node, if any still alive.

Did not managed to make test, tested with adding sleep in code to be able to kill some replica in that time slot.

Exclude tests:

arthurpassos

First round

arthurpassos · 2025-05-21T19:27:22Z

src/Storages/ObjectStorage/StorageObjectStorageStableTaskDistributor.cpp

@@ -179,4 +199,28 @@ std::optional<String> StorageObjectStorageStableTaskDistributor::getAnyUnprocess
    return std::nullopt;
 }

+void StorageObjectStorageStableTaskDistributor::rerunTasksForReplica(size_t number_of_current_replica)


If I understand correctly, this method re-schedules the tasks FROM a given replica to another one. Is that correct?

If that's so, I would name it differently:

rescheduleTasksFromReplica

Yes, it adds tasks from replica to unprocessed list to allow other replicas take them.

arthurpassos · 2025-05-21T19:41:40Z

src/Storages/ObjectStorage/StorageObjectStorageStableTaskDistributor.cpp

+    if (processed_file_list_ptr == processed_files.end())
+        throw Exception(
+            ErrorCodes::LOGICAL_ERROR,
+            "Replica number {} was marked as lost, can't set satk for it anymore",


arthurpassos · 2025-05-21T19:44:14Z

src/Storages/ObjectStorage/StorageObjectStorageStableTaskDistributor.h

@@ -34,6 +39,7 @@ class StorageObjectStorageStableTaskDistributor
    std::unordered_set<String> unprocessed_files;

    std::vector<std::string> ids_of_nodes;
+    std::unordered_map<size_t, std::list<String>> processed_files;


replica_to_files_to_be_processed_map or something like this.

imo

It needs to mention it is a mapping from replica to list of files

it can't be "processed" as it is in the past tense because the files haven't been processed yet as far as I understand

arthurpassos · 2025-05-21T19:46:02Z

src/QueryPipeline/RemoteQueryExecutorReadContext.cpp

-        read_context.packet = read_context.executor.getConnections().receivePacketUnlocked(async_callback);
-        read_context.has_read_packet_part = PacketPart::Body;
+    }
+    catch (const Exception &)


Shouldn't you catch only network related exceptions? Or maybe the question is: is there an non-connection-loss related exceptionthat could be thrown?

Only network exceptions - may be.
Others - not sure, It replica sent an exception because data corrupted (it may be unpredictable error), other replica with the same data gets the same error. May be possible to reschedule task with some specific exceptions, but need to know this specific cases. I don't know right now. If we found something, we can add this later.

arthurpassos

A few questions that come to mind right now:

Does a replica ever gets reconnected after being marked as lost?
As far as I understand, once a replica is assigned a file, it'll process that file and return the result before picking the next file. Therefore, StorageObjectStorageStableTaskDistributor::processed_files should be updated, shouldn't it?

arthurpassos · 2025-05-22T17:50:41Z

src/QueryPipeline/RemoteQueryExecutor.h

@@ -8,6 +8,8 @@
 #include <Interpreters/StorageID.h>
 #include <sys/types.h>

+#include <list>


Do you need this header here?

Forget to remove, thanks

ianton-ru · 2025-05-26T11:09:56Z

Does a replica ever gets reconnected after being marked as lost?
Not in this query. The problem exist only when replica already took some files to process it and goes down before processed all data. We can reschedule task only when initiator did not get any parts of data, because this parts are merged into response, and now impossible to "unmerge" them back to process all file, and we have no information to process only unprocessed part of file on other node.

As far as I understand, once a replica is assigned a file, it'll process that file and return the result before picking the next file. Therefore, StorageObjectStorageStableTaskDistributor::processed_files should be updated, shouldn't it?
Each replica takes several files (4 by default) to process in parallel. And in partial responses no information from which file is this response. So - impossible to separate data of completed files from data of non-completed, and when initiator gets something impossible to reschedule tasks.

ianton-ru · 2025-05-26T11:12:52Z

Can be rebased on antalya-25.3 branch only after merge of #797, feature depends on rendezvous hashing (uses same class).

ianton-ru · 2025-06-05T09:47:27Z

Fixed after review and rebased

ianton-ru force-pushed the feature/retries_in_cluster_functions branch 2 times, most recently from 69cce89 to fe4eee1 Compare May 19, 2025 13:23

ianton-ru marked this pull request as ready for review May 21, 2025 18:01

arthurpassos reviewed May 21, 2025

View reviewed changes

arthurpassos reviewed May 22, 2025

View reviewed changes

svb-alt added antalya antalya-25.3 labels May 23, 2025

hodgesrm mentioned this pull request May 23, 2025

Project Antalya Roadmap 2025 - Real-Time Data Lakes #804

Closed

36 tasks

svb-alt mentioned this pull request May 29, 2025

Retries in cluster requests #813

Open

svb-alt linked an issue May 29, 2025 that may be closed by this pull request

Retries in cluster requests #813

Open

Restart cluster tasks on connection lost

c3afad3

ianton-ru force-pushed the feature/retries_in_cluster_functions branch from d381de9 to c3afad3 Compare June 5, 2025 09:44

ianton-ru changed the base branch from antalya to antalya-25.3 June 5, 2025 09:44

Enmk added antalya-25.3.3 swarms Antalya Roadmap: Swarms labels Jun 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Restart cluster tasks on connection lost #780

Restart cluster tasks on connection lost #780

Uh oh!

ianton-ru commented May 14, 2025 •

edited

Loading

Uh oh!

arthurpassos left a comment

Uh oh!

arthurpassos May 21, 2025

Uh oh!

ianton-ru May 22, 2025

Uh oh!

arthurpassos May 21, 2025

Uh oh!

arthurpassos May 21, 2025

Uh oh!

arthurpassos May 21, 2025

Uh oh!

ianton-ru May 22, 2025 •

edited

Loading

Uh oh!

arthurpassos left a comment

Uh oh!

arthurpassos May 22, 2025

Uh oh!

ianton-ru May 26, 2025

Uh oh!

ianton-ru commented May 26, 2025

Uh oh!

ianton-ru commented May 26, 2025

Uh oh!

ianton-ru commented Jun 5, 2025

Uh oh!

Uh oh!

Restart cluster tasks on connection lost #780

Are you sure you want to change the base?

Restart cluster tasks on connection lost #780

Uh oh!

Conversation

ianton-ru commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

Exclude tests:

Uh oh!

arthurpassos left a comment

Choose a reason for hiding this comment

Uh oh!

arthurpassos May 21, 2025

Choose a reason for hiding this comment

Uh oh!

ianton-ru May 22, 2025

Choose a reason for hiding this comment

Uh oh!

arthurpassos May 21, 2025

Choose a reason for hiding this comment

Uh oh!

arthurpassos May 21, 2025

Choose a reason for hiding this comment

Uh oh!

arthurpassos May 21, 2025

Choose a reason for hiding this comment

Uh oh!

ianton-ru May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arthurpassos left a comment

Choose a reason for hiding this comment

Uh oh!

arthurpassos May 22, 2025

Choose a reason for hiding this comment

Uh oh!

ianton-ru May 26, 2025

Choose a reason for hiding this comment

Uh oh!

ianton-ru commented May 26, 2025

Uh oh!

ianton-ru commented May 26, 2025

Uh oh!

ianton-ru commented Jun 5, 2025

Uh oh!

Uh oh!

ianton-ru commented May 14, 2025 •

edited

Loading

ianton-ru May 22, 2025 •

edited

Loading