-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#1608][part-1] fix(spark): Only share the replacement servers for faulty servers in one stage #1609
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1609 +/- ##
============================================
+ Coverage 53.92% 54.91% +0.99%
- Complexity 2864 2873 +9
============================================
Files 438 418 -20
Lines 24912 22566 -2346
Branches 2123 2126 +3
============================================
- Hits 13433 12393 -1040
+ Misses 10636 9400 -1236
+ Partials 843 773 -70 ☔ View full report in Codecov by Sentry. |
@@ -590,7 +590,7 @@ message ReassignServersReponse{ | |||
|
|||
message RssReassignFaultyShuffleServerRequest{ | |||
int32 shuffleId = 1; | |||
repeated string partitionIds = 2; | |||
repeated int32 partitionIds = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it enough for Spark, MR and Tez?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark is enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @zhengchenyu Could you ensure this point?
PTAL @xumanbu |
I have reviewed. LGTM. |
Don't hurry. Let me take an another look. |
Gentle ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @zuston
) ### What changes were proposed in this pull request? 1. make the write client always use the latest available assignment for the following writing when the block reassign happens. 2. support multi time retry for partition reassign 3. limit the max reassign server num of one partition 4. refactor the reassign rpc 5. rename the faultyServer -> receivingFailureServer. #### Reassign whole process ![image](https://github.com/apache/incubator-uniffle/assets/8609142/8afa5386-be39-4ccb-9c10-95ffb3154939) #### Always using the latest assignment To acheive always using the latest assignment, I introduce the `TaskAttemptAssignment` to get the latest assignment for current task. The creating process of AddBlockEvent also will apply the latest assignment by `TaskAttemptAssignment` And it will be updated by the `reassignOnBlockSendFailure` rpc. That means the original reassign rpc response will be refactored and replaced by the whole latest `shuffleHandleInfo`. ### Why are the changes needed? This PR is the subtask for #1608. Leverging the #1615 / #1610 / #1609, we have implemented the reassign servers mechansim when write client encounters the server failure or unhealthy. But this is not good enough that will not share the faulty server state to the unstarted tasks and latter `AddBlockEvent` . ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Unit and integration tests. Integration tests as follows: 1. `PartitionBlockDataReassignBasicTest` to validate the reassign mechanism valid 2. `PartitionBlockDataReassignMultiTimesTest` is to test the partition reassign mechanism of multiple retries. --------- Co-authored-by: Enrico Minack <github@enrico.minack.dev>
What changes were proposed in this pull request?
shuffleHandle
to ensure the thread safe when reassigning partial server for taskspartitionIds
type fromstring
toint
in protoWhy are the changes needed?
Fix: #1608
In current implementation of partition reassignment, it will share the same reassignment servers for the different stages, which will crash for app without registry.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
UTs