Skip to content

[multistage] Fix Bug in MailboxInfo Ordering#16224

Merged
itschrispeck merged 1 commit intoapache:masterfrom
ankitsultana:mse-physical-bug
Jun 30, 2025
Merged

[multistage] Fix Bug in MailboxInfo Ordering#16224
itschrispeck merged 1 commit intoapache:masterfrom
ankitsultana:mse-physical-bug

Conversation

@ankitsultana
Copy link
Contributor

@ankitsultana ankitsultana commented Jun 28, 2025

fixes #16223

I had made an assumption that the Runtime would sort the List<SendingMailbox> used for modulo arithmetic in HashExchange, so I was creating List<MailboxInfo> without regard to ordering.

But that isn't the case, as can be seen here:

List<MailboxInfo> mailboxInfos =
context.getWorkerMetadata().getMailboxInfosMap().get(receiverStageId).getMailboxInfos();
List<RoutingInfo> routingInfos =
MailboxIdUtils.toRoutingInfos(requestId, context.getStageId(), context.getWorkerId(), receiverStageId,
mailboxInfos);
List<SendingMailbox> sendingMailboxes = routingInfos.stream()
.map(v -> mailboxService.getSendingMailbox(v.getHostname(), v.getPort(), v.getMailboxId(), deadlineMs, statMap))
.collect(Collectors.toList());
statMap.merge(StatKey.FAN_OUT, sendingMailboxes.size());
return BlockExchange.getExchange(sendingMailboxes, distributionType, keys, splitter);

While this PR should fix the flaky test and also the bug with v2 optimizer, I want to revisit this when I am implementing parallelism.

Test Plan: I built a custom test to reproduce the error with very high probability, but I can't really check that code in. So skipping UTs. But this patch has been manually verified with that custom test and the test had passed on 10 consecutive runs.

Custom Patch Details

To reproduce, I took one of the failing queries from the flaky test. The symptom was that we were getting 0 records but expectation was non-zero records. I found that after the join operator we had zero records, which meant that records were not assigned correctly to the streams. Digging deeper, I found that the issue was easy to reproduce if the two workers on left and right were assigned in reverse order of each other.

in ResourceBasedQueriesTest.java
+
+  @Test
+  public void testQueryTestCasesWithNewOptimizerWithOutputCustom()
+      throws Exception {
+    String expect = null;
+    List<Object[]> expectedRows = new ArrayList<>();
+    expectedRows.add(new Object[]{"a", 2.5});
+    expectedRows.add(new Object[]{"b", -98460.5});
+    String sql = "WITH agg1 AS ( SELECT strCol, sum(intCol) AS sumVal FROM with_statement_tes
ts_tbl1 GROUP BY strCol), agg2 AS (SELECT strCol1, avg(intCol) AS avgVal FROM with_statement_t
ests_tbl2 GROUP BY strCol1) SELECT strCol, sumVal - avgVal FROM agg1, agg2 WHERE agg1.strCol =
 agg2.strCol1";
+    final String finalSql = String.format("SET usePhysicalOptimizer=true; %s", sql);
+    runQuery(finalSql, expect, false).ifPresent(queryResult -> {
+      compareRowEqualsAndReportErrors(queryResult, sql, expectedRows, false);
+    });
in LeafStageWorkerAssignmentRule.java
     if (tableName.contains("tbl2")) {
      Map<String, Map<String, List<String>>> ff = instanceIdToTableTypeToSegmentsMap;
      instanceIdToTableTypeToSegmentsMap = new TreeMap<>(ff);
    } else {
      Map<String, Map<String, List<String>>> ff = instanceIdToTableTypeToSegmentsMap;
      Map<String, Map<String, List<String>>> nn = new TreeMap<>(Collections.reverseOrder());
      nn.putAll(ff);
      instanceIdToTableTypeToSegmentsMap = nn;
    }

@songwdfu
Copy link
Contributor

should be issue #16223 ?

@ankitsultana
Copy link
Contributor Author

should be issue #16223 ?

ah thank you. my bad. fixed it

@codecov-commenter
Copy link

codecov-commenter commented Jun 28, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 63.19%. Comparing base (1a476de) to head (81f27a0).
Report is 361 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #16224      +/-   ##
============================================
+ Coverage     62.90%   63.19%   +0.29%     
+ Complexity     1386     1365      -21     
============================================
  Files          2867     2959      +92     
  Lines        163354   170578    +7224     
  Branches      24952    26096    +1144     
============================================
+ Hits         102755   107803    +5048     
- Misses        52847    54613    +1766     
- Partials       7752     8162     +410     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.18% <100.00%> (+0.31%) ⬆️
java-21 63.16% <100.00%> (+0.34%) ⬆️
skip-bytebuffers-false ?
skip-bytebuffers-true ?
temurin 63.19% <100.00%> (+0.29%) ⬆️
unittests 63.19% <100.00%> (+0.29%) ⬆️
unittests1 64.72% <100.00%> (+8.90%) ⬆️
unittests2 33.40% <0.00%> (-0.17%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@itschrispeck itschrispeck merged commit 7c3d24e into apache:master Jun 30, 2025
35 of 36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[flaky-test] ResourceBasedQueriesTest.testQueryTestCasesWithNewOptimizerWithOutput

4 participants