[#1373][FOLLOWUP] fix(spark): register with incorrect partitionRanges after reassign #1612

dingshun3016 · 2024-04-01T14:33:53Z

What changes were proposed in this pull request?

fix partition id inconsistency when reassign new shuffle server

For example：
when writing data on node a1, the registered partition id is 1003.
a1 node fails，and reassign node b1 and register shuffle server b1，but partitionNumPerRange is 1.
when writing data to node b1, NO_REGISTER exception will be thrown

Why are the changes needed?

Fix: (#1373)

Does this PR introduce any user-facing change?

No.

How was this patch tested?

(Please test your changes, and provide instructions on how to test it:

If you add a feature or fix a bug, add a test to cover your changes.
If you fix a flaky test, repeat it for many times to prove it works.)

codecov-commenter · 2024-04-01T14:41:41Z

Codecov Report

Attention: Patch coverage is 0% with 11 lines in your changes are missing coverage. Please review.

Project coverage is 54.83%. Comparing base (05ff6db) to head (0fc351c).

Files	Patch %	Lines
...request/RssReassignFaultyShuffleServerRequest.java	0.00%	8 Missing ⚠️
...fle/shuffle/manager/ShuffleManagerGrpcService.java	0.00%	3 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #1612      +/-   ##
============================================
+ Coverage     53.92%   54.83%   +0.91%     
  Complexity     2864     2864              
============================================
  Files           438      418      -20     
  Lines         24912    22559    -2353     
  Branches       2123     2123              
============================================
- Hits          13433    12371    -1062     
+ Misses        10636     9416    -1220     
+ Partials        843      772      -71

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-04-01T15:06:01Z

Test Results

2 363 files +14 2 363 suites +14 4h 30m 50s ⏱️ +9s
911 tests + 2 910 ✅ + 3 1 💤 ±0 0 ❌ - 1
10 578 runs +28 10 564 ✅ +29 14 💤 ±0 0 ❌ - 1

Results for commit e38d920. ± Comparison against base commit 1051d26.

♻️ This comment has been updated with latest results.

zuston · 2024-04-02T02:10:25Z

#1609 has been merged to solve the similar problems. please see it in advance.

BTW, the detailed description should be filled.

jerqi · 2024-04-02T02:17:16Z

proto/src/main/proto/Rss.proto

@@ -592,6 +592,8 @@ message RssReassignFaultyShuffleServerRequest{
  int32 shuffleId  = 1;
  repeated string partitionIds = 2;
  string faultyShuffleServerId = 3;
+  int32 stageId = 4;


@zuston I have some concern about this. Will it affect the reuse of exchange, partial tasks execution of the shuffle and so on.

From the description, I didn't see any requirements for this PR, but I guess it wants to support different reassginment servers for different stage attempt.

For me, currently there is no need to support above stage attempt

… range error when reassign faulty shuffle server for tasks

zuston · 2024-04-08T05:50:27Z

I think the partitionNumPerRange should always be 1 whenever before or after reassignment. @dingshun3016 The current partitionNumPerRange may should be dropped. WDYT @jerqi

dingshun3016 · 2024-04-08T06:01:14Z

I think the partitionNumPerRange should always be 1 whenever before or after reassignment. @dingshun3016 The current partitionNumPerRange may should be dropped. WDYT @jerqi

but the failed partitionId will not always be 1, and writing to the new assigned server will fail.

zuston · 2024-04-08T09:16:48Z

Please fix the spotless @dingshun3016

zuston · 2024-04-08T09:24:09Z

I have checked this in 761dedf. So merge this. Thanks @dingshun3016

dingshun3016 · 2024-04-08T09:30:22Z

I have checked this in 761dedf. So merge this. Thanks @dingshun3016

This implementation is great

jerqi reviewed Apr 2, 2024

View reviewed changes

dingshun3016 changed the title ~~[#1373][FOLLOWUP] fix(spark): register shuffle server partition range error when reassign faulty shuffle server for tasks~~ [#1373][FOLLOWUP] fix(spark): fix partition id inconsistency when reassign faulty shuffle server for tasks Apr 2, 2024

shun01.ding added 2 commits April 8, 2024 11:12

[apache#1373][FOLLOWUP] fix(spark): register shuffle server partition…

24f4e36

… range error when reassign faulty shuffle server for tasks

rebase master and update code

09c63ef

dingshun3016 force-pushed the followup-1373-4 branch from 0fc351c to 09c63ef Compare April 8, 2024 03:58

fix code style

e38d920

change to another implementation

06e04ed

zuston changed the title ~~[#1373][FOLLOWUP] fix(spark): fix partition id inconsistency when reassign faulty shuffle server for tasks~~ [#1373][FOLLOWUP] fix(spark): register with incorrect partitionRanges after reassign Apr 8, 2024

fix spotbugs

55a404b

zuston mentioned this pull request Apr 8, 2024

Support client partition data reassign #1608

Open

9 tasks

zuston approved these changes Apr 8, 2024

View reviewed changes

zuston merged commit 3ea3aaa into apache:master Apr 8, 2024
37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#1373][FOLLOWUP] fix(spark): register with incorrect partitionRanges after reassign #1612

[#1373][FOLLOWUP] fix(spark): register with incorrect partitionRanges after reassign #1612

dingshun3016 commented Apr 1, 2024 •

edited

Loading

codecov-commenter commented Apr 1, 2024 •

edited

Loading

github-actions bot commented Apr 1, 2024 •

edited

Loading

zuston commented Apr 2, 2024 •

edited

Loading

jerqi Apr 2, 2024

zuston Apr 2, 2024

zuston commented Apr 8, 2024 •

edited

Loading

dingshun3016 commented Apr 8, 2024

zuston commented Apr 8, 2024

zuston commented Apr 8, 2024

dingshun3016 commented Apr 8, 2024

[#1373][FOLLOWUP] fix(spark): register with incorrect partitionRanges after reassign #1612

[#1373][FOLLOWUP] fix(spark): register with incorrect partitionRanges after reassign #1612

Conversation

dingshun3016 commented Apr 1, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

codecov-commenter commented Apr 1, 2024 • edited Loading

Codecov Report

github-actions bot commented Apr 1, 2024 • edited Loading

Test Results

zuston commented Apr 2, 2024 • edited Loading

jerqi Apr 2, 2024

Choose a reason for hiding this comment

zuston Apr 2, 2024

Choose a reason for hiding this comment

zuston commented Apr 8, 2024 • edited Loading

dingshun3016 commented Apr 8, 2024

zuston commented Apr 8, 2024

zuston commented Apr 8, 2024

dingshun3016 commented Apr 8, 2024

dingshun3016 commented Apr 1, 2024 •

edited

Loading

codecov-commenter commented Apr 1, 2024 •

edited

Loading

github-actions bot commented Apr 1, 2024 •

edited

Loading

zuston commented Apr 2, 2024 •

edited

Loading

zuston commented Apr 8, 2024 •

edited

Loading