[#1608][part-1] fix(spark): Only share the replacement servers for faulty servers in one stage #1609

zuston · 2024-03-29T05:49:27Z

What changes were proposed in this pull request?

Lock the shuffleHandle to ensure the thread safe when reassigning partial server for tasks
Only share the replacement servers for faulty servers in one stage rather than the whole app
Simplify the reassignment logic, like the single one replacement server which will be supported in the future, so let's remove it currently.
correct the partitionIds type from string to int in proto

Why are the changes needed?

In current implementation of partition reassignment, it will share the same reassignment servers for the different stages, which will crash for app without registry.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UTs

…ver in one stage

github-actions · 2024-03-29T05:56:28Z

Test Results

2 349 files +9 2 349 suites +9 4h 29m 23s ⏱️ + 1m 0s
909 tests +1 908 ✅ +1 1 💤 ±0 0 ❌ ±0
10 550 runs +9 10 536 ✅ +9 14 💤 ±0 0 ❌ ±0

Results for commit 8d13ffb. ± Comparison against base commit cbf4f6f.

♻️ This comment has been updated with latest results.

codecov-commenter · 2024-03-29T06:08:16Z

Codecov Report

Attention: Patch coverage is 64.70588% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 54.91%. Comparing base (cbf4f6f) to head (8d13ffb).

Files	Patch %	Lines
...va/org/apache/spark/shuffle/ShuffleHandleInfo.java	68.75%	4 Missing and 1 partial ⚠️
...request/RssReassignFaultyShuffleServerRequest.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #1609      +/-   ##
============================================
+ Coverage     53.92%   54.91%   +0.99%     
- Complexity     2864     2873       +9     
============================================
  Files           438      418      -20     
  Lines         24912    22566    -2346     
  Branches       2123     2126       +3     
============================================
- Hits          13433    12393    -1040     
+ Misses        10636     9400    -1236     
+ Partials        843      773      -70

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jerqi · 2024-03-29T06:10:52Z

proto/src/main/proto/Rss.proto

@@ -590,7 +590,7 @@ message ReassignServersReponse{

 message RssReassignFaultyShuffleServerRequest{
  int32 shuffleId  = 1;
-  repeated string partitionIds = 2;
+  repeated int32 partitionIds = 2;


Is it enough for Spark, MR and Tez?

Spark is enough

cc @zhengchenyu Could you ensure this point?

zuston · 2024-03-29T07:18:54Z

PTAL @xumanbu

xumanbu · 2024-03-29T09:06:24Z

PTAL @xumanbu

I have reviewed. LGTM.

zuston · 2024-04-01T03:58:04Z

If having no any comments, could you help merge this? @jerqi Because it blocks the #1610 development.

jerqi · 2024-04-01T04:12:49Z

Don't hurry. Let me take an another look.

zuston · 2024-04-01T12:37:32Z

Don't hurry. Let me take an another look.

Gentle ping

jerqi

LGTM, thanks @zuston

zuston · 2024-04-02T02:07:28Z

Thanks @xumanbu @jerqi for your review.

) ### What changes were proposed in this pull request? 1. make the write client always use the latest available assignment for the following writing when the block reassign happens. 2. support multi time retry for partition reassign 3. limit the max reassign server num of one partition 4. refactor the reassign rpc 5. rename the faultyServer -> receivingFailureServer. #### Reassign whole process ![image](https://github.com/apache/incubator-uniffle/assets/8609142/8afa5386-be39-4ccb-9c10-95ffb3154939) #### Always using the latest assignment To acheive always using the latest assignment, I introduce the `TaskAttemptAssignment` to get the latest assignment for current task. The creating process of AddBlockEvent also will apply the latest assignment by `TaskAttemptAssignment` And it will be updated by the `reassignOnBlockSendFailure` rpc. That means the original reassign rpc response will be refactored and replaced by the whole latest `shuffleHandleInfo`. ### Why are the changes needed? This PR is the subtask for #1608. Leverging the #1615 / #1610 / #1609, we have implemented the reassign servers mechansim when write client encounters the server failure or unhealthy. But this is not good enough that will not share the faulty server state to the unstarted tasks and latter `AddBlockEvent` . ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Unit and integration tests. Integration tests as follows: 1. `PartitionBlockDataReassignBasicTest` to validate the reassign mechanism valid 2. `PartitionBlockDataReassignMultiTimesTest` is to test the partition reassign mechanism of multiple retries. --------- Co-authored-by: Enrico Minack <github@enrico.minack.dev>

zuston added 2 commits March 29, 2024 13:45

[apache#1608] fix(spark): re-assign only once for the same faulty ser…

e704b56

…ver in one stage

checkstyle

e7e89aa

fix some unresolved bugs

8d13ffb

zuston changed the title ~~[#1608] fix(spark): re-assign only once for the same faulty server in one stage~~ [#1608] fix(spark): Only share the replacement servers for faulty servers in one stage Mar 29, 2024

zuston requested a review from jerqi March 29, 2024 06:05

jerqi reviewed Mar 29, 2024

View reviewed changes

zuston changed the title ~~[#1608] fix(spark): Only share the replacement servers for faulty servers in one stage~~ [#1608][part-1] fix(spark): Only share the replacement servers for faulty servers in one stage Mar 29, 2024

zuston requested a review from jerqi March 29, 2024 10:06

zuston mentioned this pull request Mar 30, 2024

Support client partition data reassign #1608

Open

9 tasks

jerqi approved these changes Apr 2, 2024

View reviewed changes

zuston merged commit 1051d26 into apache:master Apr 2, 2024
41 checks passed

zuston mentioned this pull request Apr 2, 2024

[#1373][FOLLOWUP] fix(spark): register with incorrect partitionRanges after reassign #1612

Merged

zuston mentioned this pull request Apr 17, 2024

[#1608][part-5] feat(spark3): always use the available assignment #1652

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#1608][part-1] fix(spark): Only share the replacement servers for faulty servers in one stage #1609

[#1608][part-1] fix(spark): Only share the replacement servers for faulty servers in one stage #1609

zuston commented Mar 29, 2024 •

edited

Loading

github-actions bot commented Mar 29, 2024 •

edited

Loading

codecov-commenter commented Mar 29, 2024 •

edited

Loading

jerqi Mar 29, 2024

zuston Mar 29, 2024

jerqi Mar 29, 2024

zuston commented Mar 29, 2024

xumanbu commented Mar 29, 2024

zuston commented Apr 1, 2024

jerqi commented Apr 1, 2024

zuston commented Apr 1, 2024

jerqi left a comment

zuston commented Apr 2, 2024

[#1608][part-1] fix(spark): Only share the replacement servers for faulty servers in one stage #1609

[#1608][part-1] fix(spark): Only share the replacement servers for faulty servers in one stage #1609

Conversation

zuston commented Mar 29, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

github-actions bot commented Mar 29, 2024 • edited Loading

Test Results

codecov-commenter commented Mar 29, 2024 • edited Loading

Codecov Report

jerqi Mar 29, 2024

Choose a reason for hiding this comment

zuston Mar 29, 2024

Choose a reason for hiding this comment

jerqi Mar 29, 2024

Choose a reason for hiding this comment

zuston commented Mar 29, 2024

xumanbu commented Mar 29, 2024

zuston commented Apr 1, 2024

jerqi commented Apr 1, 2024

zuston commented Apr 1, 2024

jerqi left a comment

Choose a reason for hiding this comment

zuston commented Apr 2, 2024

zuston commented Mar 29, 2024 •

edited

Loading

github-actions bot commented Mar 29, 2024 •

edited

Loading

codecov-commenter commented Mar 29, 2024 •

edited

Loading