Rewrite FailureDetector interface and implementations to also work with the multi-stage engine by yashmayya · Pull Request #15005 · apache/pinot

yashmayya · 2025-02-06T14:29:10Z

The connection based failure detector was added in Add connection based FailureDetector #8491 and doesn't currently support the multi-stage engine.
This patch rewrites the FailureDetector interface and implementations so that it works with both the engines.
If a multi-stage query fails to be dispatched to a server due to connectivity issues, the broker should remove the server from its routing table to prevent further query failures.
A server will be removed from the broker routing table in case of failures in either QueryRouter (v1) or QueryDispatcher (v2). In the FailureDetector implementation that retries the connection with an exponential backoff, we'll only re-add a server to the routing table if both the connections succeed - the Netty channel used for v1 queries as well as the gRPC channel used for v2 queries.
The FailureDetector and listener based interface has been rewritten significantly in order to avoid multiple attempts to modify the broker routing table whenever a server is detected as healthy / unhealthy.
This patch also updated the single-stage engine's GrpcBrokerRequestHandler to make use of the FailureDetector.

codecov-commenter · 2025-02-06T15:09:33Z

Codecov Report

Attention: Patch coverage is 50.00000% with 51 lines in your changes missing coverage. Please review.

Project coverage is 63.31%. Comparing base (59551e4) to head (439cd37).
Report is 1709 commits behind head on master.

Files with missing lines	Patch %	Lines
...roker/requesthandler/GrpcBrokerRequestHandler.java	0.00%	28 Missing ⚠️
.../pinot/query/service/dispatch/QueryDispatcher.java	57.89%	8 Missing ⚠️
...requesthandler/MultiStageBrokerRequestHandler.java	40.00%	6 Missing ⚠️
...thandler/SingleConnectionBrokerRequestHandler.java	28.57%	5 Missing ⚠️
...e/pinot/broker/broker/helix/BaseBrokerStarter.java	71.42%	2 Missing ⚠️
...che/pinot/broker/routing/BrokerRoutingManager.java	0.00%	1 Missing ⚠️
...pache/pinot/common/utils/grpc/GrpcQueryClient.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #15005      +/-   ##
============================================
+ Coverage     61.75%   63.31%   +1.56%     
- Complexity      207     1483    +1276     
============================================
  Files          2436     2731     +295     
  Lines        133233   153622   +20389     
  Branches      20636    23730    +3094     
============================================
+ Hits          82274    97267   +14993     
- Misses        44911    49002    +4091     
- Partials       6048     7353    +1305

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (+99.99%)`	⬆️
integration	`100.00% <ø> (+99.99%)`	⬆️
integration1	`100.00% <ø> (+99.99%)`	⬆️
integration2	`0.00% <ø> (ø)`
java-11	`63.29% <50.00%> (+1.58%)`	⬆️
java-21	`63.21% <50.00%> (+1.58%)`	⬆️
skip-bytebuffers-false	`63.31% <50.00%> (+1.56%)`	⬆️
skip-bytebuffers-true	`63.18% <50.00%> (+35.45%)`	⬆️
temurin	`63.31% <50.00%> (+1.56%)`	⬆️
unittests	`63.31% <50.00%> (+1.56%)`	⬆️
unittests1	`56.07% <81.63%> (+9.18%)`	⬆️
unittests2	`33.73% <17.64%> (+6.00%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

yashmayya · 2025-02-06T17:06:02Z

...-common/src/main/java/org/apache/pinot/common/failuredetector/ConnectionFailureDetector.java

+ * <p>
+ * This class doesn't currently implement any additional logic over BaseExponentialBackoffRetryFailureDetector and is
+ * retained for backward compatibility.


On second thought we could probably remove this class altogether since the class name is not exposed via the user configuration directly.

…th the multi-stage engine

…port

gortiz · 2025-02-13T08:48:42Z

pinot-broker/src/main/java/org/apache/pinot/broker/requesthandler/GrpcBrokerRequestHandler.java

+  /**
+   * Check if a server that was previously detected as unhealthy is now healthy.
+   */
+  public boolean retryUnhealthyServer(String instanceId) {


why public?

Oops, my bad, left over from a previous iteration with a different rewrite implementation. I've update it to private.

gortiz · 2025-02-13T09:08:15Z

...java/org/apache/pinot/common/failuredetector/BaseExponentialBackoffRetryFailureDetector.java

+          for (Function<String, Boolean> unhealthyServerRetrier : _unhealthyServerRetriers) {
+            if (!unhealthyServerRetrier.apply(instanceId)) {
+              recovered = false;


I was just going to comment about this section when it was written the other way around, but it looks like you just changed it!

Yeah with the previous implementation, it was possible for multiple listeners to mark the same server as healthy on a call to retryUnhealthyServer. I've updated the implementation so that the failure detector itself marks the server as healthy if all the listeners / retriers report that the server is healthy again (to be safe).

...ker/src/main/java/org/apache/pinot/broker/requesthandler/MultiStageBrokerRequestHandler.java

…th the multi-stage engine (apache#15005)

yashmayya added enhancement multi-stage Related to the multi-stage query engine labels Feb 6, 2025

yashmayya requested a review from Jackie-Jiang February 6, 2025 14:29

yashmayya commented Feb 6, 2025

View reviewed changes

yashmayya force-pushed the failure-detector-msqe branch 2 times, most recently from 67f10b0 to e8c1cfa Compare February 7, 2025 02:50

yashmayya marked this pull request as ready for review February 7, 2025 03:55

gortiz self-requested a review February 13, 2025 07:44

Rewrite FailureDetector interface and implementations to also work wi…

fb941c5

…th the multi-stage engine

yashmayya force-pushed the failure-detector-msqe branch from e8c1cfa to 46616f0 Compare February 13, 2025 08:54

Fix issue where same instanceId comes back with different hostname / …

a8aabd3

…port

yashmayya force-pushed the failure-detector-msqe branch from 46616f0 to a8aabd3 Compare February 13, 2025 08:57

gortiz reviewed Feb 13, 2025

View reviewed changes

gortiz approved these changes Feb 13, 2025

View reviewed changes

Make GrpcBrokerRequestHandler::retryUnhealthyServer private

4bbe70c

vrajat approved these changes Feb 13, 2025

View reviewed changes

...ker/src/main/java/org/apache/pinot/broker/requesthandler/MultiStageBrokerRequestHandler.java Outdated Show resolved Hide resolved

Re-use brokerId from parent class

439cd37

yashmayya merged commit f65f845 into apache:master Feb 14, 2025
21 checks passed

yashmayya added a commit to yashmayya/pinot that referenced this pull request Feb 18, 2025

Rewrite FailureDetector interface and implementations to also work wi…

e53db02

…th the multi-stage engine (apache#15005)

This was referenced Feb 18, 2025

Improve FailureDetector logic in QueryDispatcher for MSQE #15084

Merged

Update FailureDetector recovery logic to not break if only one query engine is being used #15086

Merged

zeronerdzerogeekzerocool pushed a commit to zeronerdzerogeekzerocool/pinot that referenced this pull request Feb 20, 2025

Rewrite FailureDetector interface and implementations to also work wi…

f8a3955

…th the multi-stage engine (apache#15005)

yashmayya mentioned this pull request Feb 24, 2025

Fix stale host / port bug for failure detector #15116

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite FailureDetector interface and implementations to also work with the multi-stage engine#15005

Rewrite FailureDetector interface and implementations to also work with the multi-stage engine#15005
yashmayya merged 4 commits intoapache:masterfrom
yashmayya:failure-detector-msqe

yashmayya commented Feb 6, 2025

Uh oh!

codecov-commenter commented Feb 6, 2025 •

edited

Loading

Uh oh!

yashmayya Feb 6, 2025

Uh oh!

gortiz Feb 13, 2025

Uh oh!

yashmayya Feb 13, 2025

Uh oh!

gortiz Feb 13, 2025

Uh oh!

yashmayya Feb 13, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yashmayya commented Feb 6, 2025

Uh oh!

codecov-commenter commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

yashmayya Feb 6, 2025

Choose a reason for hiding this comment

Uh oh!

gortiz Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

yashmayya Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

gortiz Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

yashmayya Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Feb 6, 2025 •

edited

Loading