Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#1608][part-1] fix(spark): Only share the replacement servers for faulty servers in one stage #1609

Merged
merged 3 commits into from
Apr 2, 2024

Conversation

zuston
Copy link
Member

@zuston zuston commented Mar 29, 2024

What changes were proposed in this pull request?

  1. Lock the shuffleHandle to ensure the thread safe when reassigning partial server for tasks
  2. Only share the replacement servers for faulty servers in one stage rather than the whole app
  3. Simplify the reassignment logic, like the single one replacement server which will be supported in the future, so let's remove it currently.
  4. correct the partitionIds type from string to int in proto

Why are the changes needed?

Fix: #1608

In current implementation of partition reassignment, it will share the same reassignment servers for the different stages, which will crash for app without registry.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UTs

Copy link

github-actions bot commented Mar 29, 2024

Test Results

 2 349 files  +9   2 349 suites  +9   4h 29m 23s ⏱️ + 1m 0s
   909 tests +1     908 ✅ +1   1 💤 ±0  0 ❌ ±0 
10 550 runs  +9  10 536 ✅ +9  14 💤 ±0  0 ❌ ±0 

Results for commit 8d13ffb. ± Comparison against base commit cbf4f6f.

♻️ This comment has been updated with latest results.

@zuston zuston changed the title [#1608] fix(spark): re-assign only once for the same faulty server in one stage [#1608] fix(spark): Only share the replacement servers for faulty servers in one stage Mar 29, 2024
@zuston zuston requested a review from jerqi March 29, 2024 06:05
@codecov-commenter
Copy link

codecov-commenter commented Mar 29, 2024

Codecov Report

Attention: Patch coverage is 64.70588% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 54.91%. Comparing base (cbf4f6f) to head (8d13ffb).

Files Patch % Lines
...va/org/apache/spark/shuffle/ShuffleHandleInfo.java 68.75% 4 Missing and 1 partial ⚠️
...request/RssReassignFaultyShuffleServerRequest.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #1609      +/-   ##
============================================
+ Coverage     53.92%   54.91%   +0.99%     
- Complexity     2864     2873       +9     
============================================
  Files           438      418      -20     
  Lines         24912    22566    -2346     
  Branches       2123     2126       +3     
============================================
- Hits          13433    12393    -1040     
+ Misses        10636     9400    -1236     
+ Partials        843      773      -70     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -590,7 +590,7 @@ message ReassignServersReponse{

message RssReassignFaultyShuffleServerRequest{
int32 shuffleId = 1;
repeated string partitionIds = 2;
repeated int32 partitionIds = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it enough for Spark, MR and Tez?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark is enough

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @zhengchenyu Could you ensure this point?

@zuston
Copy link
Member Author

zuston commented Mar 29, 2024

PTAL @xumanbu

@xumanbu
Copy link
Contributor

xumanbu commented Mar 29, 2024

PTAL @xumanbu

I have reviewed. LGTM.

@zuston zuston changed the title [#1608] fix(spark): Only share the replacement servers for faulty servers in one stage [#1608][part-1] fix(spark): Only share the replacement servers for faulty servers in one stage Mar 29, 2024
@zuston zuston requested a review from jerqi March 29, 2024 10:06
@zuston
Copy link
Member Author

zuston commented Apr 1, 2024

If having no any comments, could you help merge this? @jerqi Because it blocks the #1610 development.

@jerqi
Copy link
Contributor

jerqi commented Apr 1, 2024

Don't hurry. Let me take an another look.

@zuston
Copy link
Member Author

zuston commented Apr 1, 2024

Don't hurry. Let me take an another look.

Gentle ping

Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @zuston

@zuston zuston merged commit 1051d26 into apache:master Apr 2, 2024
41 checks passed
@zuston
Copy link
Member Author

zuston commented Apr 2, 2024

Thanks @xumanbu @jerqi for your review.

zuston added a commit that referenced this pull request May 9, 2024
)

### What changes were proposed in this pull request?

1. make the write client always use the latest available assignment for the following writing when the block reassign happens.
2. support multi time retry for partition reassign
3. limit the max reassign server num of one partition
4. refactor the reassign rpc
5. rename the faultyServer -> receivingFailureServer. 

#### Reassign whole process
![image](https://github.com/apache/incubator-uniffle/assets/8609142/8afa5386-be39-4ccb-9c10-95ffb3154939)

#### Always using the latest assignment

To acheive always using the latest assignment, I introduce the `TaskAttemptAssignment` to get the latest assignment for current task. The creating process of AddBlockEvent also will apply the latest assignment by `TaskAttemptAssignment` 

And it will be updated by the `reassignOnBlockSendFailure` rpc. 
That means the original reassign rpc response will be refactored and replaced by the whole latest `shuffleHandleInfo`.

### Why are the changes needed?

This PR is the subtask for #1608.

Leverging the #1615 / #1610 / #1609, we have implemented the reassign servers mechansim when write client encounters the server failure or unhealthy. But this is not good enough that will not share the faulty server state to the unstarted tasks and latter `AddBlockEvent` .

### Does this PR introduce _any_ user-facing change?

Yes. 

### How was this patch tested?

Unit and integration tests.

Integration tests as follows:
1. `PartitionBlockDataReassignBasicTest` to validate the reassign mechanism valid
2. `PartitionBlockDataReassignMultiTimesTest` is to test the partition reassign mechanism of multiple retries.

---------

Co-authored-by: Enrico Minack <github@enrico.minack.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support client partition data reassign
4 participants