Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#1373][FOLLOWUP] fix(spark): register with incorrect partitionRanges after reassign #1612

Merged
merged 5 commits into from
Apr 8, 2024

Conversation

dingshun3016
Copy link
Contributor

@dingshun3016 dingshun3016 commented Apr 1, 2024

What changes were proposed in this pull request?

fix partition id inconsistency when reassign new shuffle server

For example:
when writing data on node a1, the registered partition id is 1003.
a1 node fails,and reassign node b1 and register shuffle server b1,but partitionNumPerRange is 1.
when writing data to node b1, NO_REGISTER exception will be thrown

Why are the changes needed?

Fix: (#1373)

Does this PR introduce any user-facing change?

No.

How was this patch tested?

(Please test your changes, and provide instructions on how to test it:

  1. If you add a feature or fix a bug, add a test to cover your changes.
  2. If you fix a flaky test, repeat it for many times to prove it works.)

@codecov-commenter
Copy link

codecov-commenter commented Apr 1, 2024

Codecov Report

Attention: Patch coverage is 0% with 11 lines in your changes are missing coverage. Please review.

Project coverage is 54.83%. Comparing base (05ff6db) to head (0fc351c).

Files Patch % Lines
...request/RssReassignFaultyShuffleServerRequest.java 0.00% 8 Missing ⚠️
...fle/shuffle/manager/ShuffleManagerGrpcService.java 0.00% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #1612      +/-   ##
============================================
+ Coverage     53.92%   54.83%   +0.91%     
  Complexity     2864     2864              
============================================
  Files           438      418      -20     
  Lines         24912    22559    -2353     
  Branches       2123     2123              
============================================
- Hits          13433    12371    -1062     
+ Misses        10636     9416    -1220     
+ Partials        843      772      -71     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

github-actions bot commented Apr 1, 2024

Test Results

 2 363 files  +14   2 363 suites  +14   4h 30m 50s ⏱️ +9s
   911 tests + 2     910 ✅ + 3   1 💤 ±0  0 ❌  - 1 
10 578 runs  +28  10 564 ✅ +29  14 💤 ±0  0 ❌  - 1 

Results for commit e38d920. ± Comparison against base commit 1051d26.

♻️ This comment has been updated with latest results.

@zuston
Copy link
Member

zuston commented Apr 2, 2024

#1609 has been merged to solve the similar problems. please see it in advance.

BTW, the detailed description should be filled.

@@ -592,6 +592,8 @@ message RssReassignFaultyShuffleServerRequest{
int32 shuffleId = 1;
repeated string partitionIds = 2;
string faultyShuffleServerId = 3;
int32 stageId = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zuston I have some concern about this. Will it affect the reuse of exchange, partial tasks execution of the shuffle and so on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the description, I didn't see any requirements for this PR, but I guess it wants to support different reassginment servers for different stage attempt.

For me, currently there is no need to support above stage attempt

@dingshun3016 dingshun3016 changed the title [#1373][FOLLOWUP] fix(spark): register shuffle server partition range error when reassign faulty shuffle server for tasks [#1373][FOLLOWUP] fix(spark): fix partition id inconsistency when reassign faulty shuffle server for tasks Apr 2, 2024
shun01.ding added 2 commits April 8, 2024 11:12
@zuston
Copy link
Member

zuston commented Apr 8, 2024

I think the partitionNumPerRange should always be 1 whenever before or after reassignment. @dingshun3016 The current partitionNumPerRange may should be dropped. WDYT @jerqi

@dingshun3016
Copy link
Contributor Author

I think the partitionNumPerRange should always be 1 whenever before or after reassignment. @dingshun3016 The current partitionNumPerRange may should be dropped. WDYT @jerqi

but the failed partitionId will not always be 1, and writing to the new assigned server will fail.

@zuston
Copy link
Member

zuston commented Apr 8, 2024

Please fix the spotless @dingshun3016

@zuston zuston changed the title [#1373][FOLLOWUP] fix(spark): fix partition id inconsistency when reassign faulty shuffle server for tasks [#1373][FOLLOWUP] fix(spark): register with incorrect partitionRanges after reassign Apr 8, 2024
@zuston
Copy link
Member

zuston commented Apr 8, 2024

I have checked this in 761dedf. So merge this. Thanks @dingshun3016

@zuston zuston merged commit 3ea3aaa into apache:master Apr 8, 2024
37 checks passed
@dingshun3016
Copy link
Contributor Author

I have checked this in 761dedf. So merge this. Thanks @dingshun3016

This implementation is great

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants