Description
Motivation
After reviewing #1445 again(partition data reassign, which is disabled by default in the master branch), I found some bugs and design problems. I will use this issue to track the further optimizations.
Subtasks tracking
- [#1608][part-1] fix(spark): Only share the replacement servers for faulty servers in one stage #1609
- [#1608][part-2] fix(spark): avoid releasing block in advance when enable block resend #1610
- [#1373][FOLLOWUP] fix(spark): shuffle manager rpc service invalid when partition data reassign is enabled #1583
- [#1373][FOLLOWUP] fix(spark): register with incorrect partitionRanges after reassign #1612
- [#1608][part-3] feat(spark3): support reading partition data from multiple reassignment servers #1615
- [#1608][part-4] feat(server)(spark3): activate partition reassign when server is inactive #1617
- [#1608][part-5] feat(spark3): always use the available assignment #1652
- [#1608][part-6] improvement(spark): verify the sent blocks count #1690
- [#1608][part-7] improvement(doc): add doc and optimize reassign config options #1693
Design thought
reassign rpc between with spark driver + executor
This part has been involved in the #1445 design doc, I will not describe more.
reassign signal propagation
In current codebase, the latest reassign partition-> servers plan won't be propagated into the next start tasks.
To solve this problem, I will make writer always get the latest partition->servers plan. Once the reassign signal happens,
the cached shuffleHandleInfo will be updated by the reassign rpc returned.
For the next start task(task2) after reassign tasks finished, task2 will get the latest plan according to the replacement + normal servers list. It will avoid writing to the faulty servers again.
reassign multiple servers for one partition
This topic is scoped in the single replica.
For the different type partition, we will have different strategies for the partition -> multiple servers assign.
For huge partition, I will hope that after recogizing the huge_partition, we will request reassign multiple servers by rpc, and the task will acheive its owned partitioned server by the hash mechanism by its taskAttemptId,
which will make load balance valid.
For normal partition, the multiple servers are only valid on the reassign multiple times due to the expected problems.
For this case, the task will always get the last server to write.