[CELEBORN-1496] Differentiate map results with only different stageAttemptId #2609

jiang13021 · 2024-07-09T12:38:39Z

What changes were proposed in this pull request?

Let attemptNumber = (stageAttemptId << 16) | taskAttemptNumber, to differentiate map results with only different stageAttemptId.

Why are the changes needed?

If we can't differentiate map tasks with only different stageAttemptId, it may lead to mixed reading of two map tasks' shuffle write batches during shuffle read, causing data correctness issue.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add ut: org.apache.spark.shuffle.celeborn.SparkShuffleManagerSuite#testWrongSparkConf_MaxAttemptLimit

jiang13021 · 2024-07-09T13:19:28Z

client-spark/spark-2/src/main/java/org/apache/spark/shuffle/celeborn/SparkShuffleManager.java

@@ -66,6 +66,26 @@ public class SparkShuffleManager implements ShuffleManager {
  private ExecutorShuffleIdTracker shuffleIdTracker = new ExecutorShuffleIdTracker();

  public SparkShuffleManager(SparkConf conf, boolean isDriver) {
+    int maxStageAttempts =
+        conf.getInt(
+            "spark.stage.maxConsecutiveAttempts",


"spark.stage.maxConsecutiveAttempts" becomes a variable of config.package through this PR (apache/spark#42061), so all versions of spark2 and some versions of spark3 cannot get this config through the variable of config.package

mridulm

When throwsFetchFailure is enabled this should be handled - I would suggest setting celeborn.client.spark.fetch.throwsFetchFailure to true and trying.

That flag should fix this issue (as well as allow recomputation of lost shuffle data !).

If there are specific reasons why it cant be enabled (since it is still false by default !) - I would suggest:
a) work with spark community to enforce this limit.
b) Once (a) is done, scope the change to when throwsFetchFailure is false.

client-spark/spark-2/src/main/java/org/apache/spark/shuffle/celeborn/SparkUtils.java

waitinfuture

LGTM except minor change, thanks!

...nt-spark/spark-2/src/main/java/org/apache/spark/shuffle/celeborn/HashBasedShuffleWriter.java

client-spark/spark-2/src/main/java/org/apache/spark/shuffle/celeborn/SparkShuffleManager.java

client-spark/spark-2/src/main/java/org/apache/spark/shuffle/celeborn/SparkUtils.java

...nt-spark/spark-2/src/main/java/org/apache/spark/shuffle/celeborn/HashBasedShuffleWriter.java

mridulm · 2024-08-03T01:50:17Z

tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/CelebornFetchFailureSuite.scala

@@ -297,4 +297,45 @@ class CelebornFetchFailureSuite extends AnyFunSuite
      sparkSession.stop()
    }
  }
+
+  test("celeborn spark integration test - resubmit an unordered barrier stage") {


This specific test will pass when we support barrier stages (even without the changes in this PR).
@RexXiong's changes did reproduce the issue - perhaps adapt it here ?

### What changes were proposed in this pull request? Adds support for barrier stages. This involves two aspects: a) If there is a task failure when executing a barrier stage, all shuffle output for the stage attempt are discarded and ignored. b) If there is a reexecution of a barrier stage (for ex, due to child stage getting a fetch failure), all shuffle output for the previous stage attempt are discarded and ignored. This is similar to handling of indeterminate stages when `throwsFetchFailure` is `true`. Note that this is supported only when `spark.celeborn.client.spark.fetch.throwsFetchFailure` is `true` ### Why are the changes needed? As detailed in CELEBORN-1518, Celeborn currently does not support barrier stages; which is an essential functionality in Apache Spark which is widely in use by Spark users. Enhancing Celeborn will allow its use for a wider set of Spark users. ### Does this PR introduce _any_ user-facing change? Adds ability for Celeborn to support Apache Spark Barrier stages. ### How was this patch tested? Existing tests, and additional tests (thanks to jiang13021 in #2609 - [see here](https://github.com/apache/celeborn/pull/2609/files#diff-e17f15fcca26ddfc412f0af159c784d72417b0f22598e1b1ebfcacd6d4c3ad35)) Closes #2639 from mridulm/fix-barrier-stage-reexecution. Lead-authored-by: Mridul Muralidharan <mridul@gmail.com> Co-authored-by: Mridul Muralidharan <mridulatgmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>

Adds support for barrier stages. This involves two aspects: a) If there is a task failure when executing a barrier stage, all shuffle output for the stage attempt are discarded and ignored. b) If there is a reexecution of a barrier stage (for ex, due to child stage getting a fetch failure), all shuffle output for the previous stage attempt are discarded and ignored. This is similar to handling of indeterminate stages when `throwsFetchFailure` is `true`. Note that this is supported only when `spark.celeborn.client.spark.fetch.throwsFetchFailure` is `true` As detailed in CELEBORN-1518, Celeborn currently does not support barrier stages; which is an essential functionality in Apache Spark which is widely in use by Spark users. Enhancing Celeborn will allow its use for a wider set of Spark users. Adds ability for Celeborn to support Apache Spark Barrier stages. Existing tests, and additional tests (thanks to jiang13021 in #2609 - [see here](https://github.com/apache/celeborn/pull/2609/files#diff-e17f15fcca26ddfc412f0af159c784d72417b0f22598e1b1ebfcacd6d4c3ad35)) Closes #2639 from mridulm/fix-barrier-stage-reexecution. Lead-authored-by: Mridul Muralidharan <mridul@gmail.com> Co-authored-by: Mridul Muralidharan <mridulatgmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> (cherry picked from commit 3234bef) Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>

…temptId

client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java

common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala

...k/spark-2/src/test/scala/org/apache/spark/shuffle/celeborn/CelebornShuffleManagerSuite.scala

…eleborn/CelebornShuffleManagerSuite.scala

...k/spark-3/src/test/scala/org/apache/spark/shuffle/celeborn/CelebornShuffleManagerSuite.scala

…eleborn/CelebornShuffleManagerSuite.scala

codecov · 2024-08-12T12:07:04Z

Codecov Report

Attention: Patch coverage is 83.33333% with 1 line in your changes missing coverage. Please review.

Project coverage is 33.32%. Comparing base (ea6617c) to head (b0ac8a7).
Report is 22 commits behind head on main.

Files	Patch %	Lines
...cala/org/apache/celeborn/common/CelebornConf.scala	83.34%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2609      +/-   ##
==========================================
- Coverage   39.83%   33.32%   -6.50%     
==========================================
  Files         239      310      +71     
  Lines       15026    18219    +3193     
  Branches     1362     1673     +311     
==========================================
+ Hits         5984     6070      +86     
- Misses       8711    11809    +3098     
- Partials      331      340       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mridulm

Looks reasonable to me.

+CC @waitinfuture and @RexXiong as well.

My only hesitation is whether we can generalize how we are handling testRandomPushForStageRerun ... it is hyper specific to that test.

mridulm · 2024-08-20T05:30:40Z

client-spark/common/src/main/java/org/apache/spark/shuffle/celeborn/SparkCommonUtils.java

+  }
+
+  public static int getEncodedAttemptNumber(TaskContext context) {
+    return (context.stageAttemptNumber() << 16) | context.attemptNumber();


As we discussed earlier (I cant seem to find the ref :-) ) - please do submit a PR to Apache Spark as well for this - and ensure the communities can align on this assumption.

mridulm · 2024-08-20T05:33:25Z

client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java

+          if (testRandomPushForStageRerun && shuffleId == 0 && !alreadyReadChunk) {
+            alreadyReadChunk = true;
+          } else if (testRandomPushForStageRerun && shuffleId == 0 && alreadyReadChunk) {
+            alreadyReadChunk = false;
+            throw new CelebornIOException("already read chunk");
+          }


super nit:

Suggested change

if (testRandomPushForStageRerun && shuffleId == 0 && !alreadyReadChunk) {

alreadyReadChunk = true;

} else if (testRandomPushForStageRerun && shuffleId == 0 && alreadyReadChunk) {

alreadyReadChunk = false;

throw new CelebornIOException("already read chunk");

}

if (testRandomPushForStageRerun) {

if (shuffleId == 0 && !alreadyReadChunk) {

alreadyReadChunk = true;

} else if (shuffleId == 0 && alreadyReadChunk) {

alreadyReadChunk = false;

throw new CelebornIOException("already read chunk");

}

}

RexXiong · 2024-08-21T01:47:38Z

Looks reasonable to me.

+CC @waitinfuture and @RexXiong as well.

My only hesitation is whether we can generalize how we are handling testRandomPushForStageRerun ... it is hyper specific to that test.

Agree. IMO we should eliminate this test from the standard read/write code. We only need to ensure that a different attempt ID is used for shuffle writes, as Distinguishing the output data with different attempt IDs aligns with our previous approach.

This reverts commit f547598.

…in handlePushMergeData" This reverts commit 6b24f5d.

This reverts commit e0affd7.

…gWrites in handlePushMergeData"" This reverts commit 69d930f.

RexXiong

LGTM, thanks! Merge to main(v0.6.0) and branch-0.5(v0.5.2)

…temptId ### What changes were proposed in this pull request? Let attemptNumber = (stageAttemptId << 16) | taskAttemptNumber, to differentiate map results with only different stageAttemptId. ### Why are the changes needed? If we can't differentiate map tasks with only different stageAttemptId, it may lead to mixed reading of two map tasks' shuffle write batches during shuffle read, causing data correctness issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add ut: org.apache.spark.shuffle.celeborn.SparkShuffleManagerSuite#testWrongSparkConf_MaxAttemptLimit Closes #2609 from jiang13021/spark_stage_attempt_id. Lead-authored-by: jiang13021 <jiangyanze.jyz@antgroup.com> Co-authored-by: Fu Chen <cfmcgrady@gmail.com> Co-authored-by: Shuang <lvshuang.xjs@alibaba-inc.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com> (cherry picked from commit 3853075) Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>

…temptId Let attemptNumber = (stageAttemptId << 16) | taskAttemptNumber, to differentiate map results with only different stageAttemptId. If we can't differentiate map tasks with only different stageAttemptId, it may lead to mixed reading of two map tasks' shuffle write batches during shuffle read, causing data correctness issue. No Add ut: org.apache.spark.shuffle.celeborn.SparkShuffleManagerSuite#testWrongSparkConf_MaxAttemptLimit Closes apache#2609 from jiang13021/spark_stage_attempt_id. Lead-authored-by: jiang13021 <jiangyanze.jyz@antgroup.com> Co-authored-by: Fu Chen <cfmcgrady@gmail.com> Co-authored-by: Shuang <lvshuang.xjs@alibaba-inc.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>

…temptId ### What changes were proposed in this pull request? Let attemptNumber = (stageAttemptId << 16) | taskAttemptNumber, to differentiate map results with only different stageAttemptId. ### Why are the changes needed? If we can't differentiate map tasks with only different stageAttemptId, it may lead to mixed reading of two map tasks' shuffle write batches during shuffle read, causing data correctness issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add ut: org.apache.spark.shuffle.celeborn.SparkShuffleManagerSuite#testWrongSparkConf_MaxAttemptLimit Closes apache#2609 from jiang13021/spark_stage_attempt_id. Lead-authored-by: jiang13021 <jiangyanze.jyz@antgroup.com> Co-authored-by: Fu Chen <cfmcgrady@gmail.com> Co-authored-by: Shuang <lvshuang.xjs@alibaba-inc.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>

…ageAttemptId backport #2609 to branch-0.4 ### What changes were proposed in this pull request? Let attemptNumber = (stageAttemptId << 16) | taskAttemptNumber, to differentiate map results with only different stageAttemptId. ### Why are the changes needed? If we can't differentiate map tasks with only different stageAttemptId, it may lead to mixed reading of two map tasks' shuffle write batches during shuffle read, causing data correctness issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add ut: org.apache.spark.shuffle.celeborn.SparkShuffleManagerSuite#testWrongSparkConf_MaxAttemptLimit Closes #2609 from jiang13021/spark_stage_attempt_id. Lead-authored-by: jiang13021 <jiangyanze.jyzantgroup.com> Closes #2717 from cfmcgrady/CELEBORN-1496-branch-0.4. Authored-by: jiang13021 <jiangyanze.jyz@antgroup.com> Signed-off-by: Fu Chen <cfmcgrady@gmail.com>

### What changes were proposed in this pull request? Introduce `spark-3.5-columnar-shuffle` module to support columnar shuffle for Spark 3.5. Follow up #2710, #2609. ### Why are the changes needed? Tests of `CelebornColumnarShuffleReaderSuite` are failed for the changes of #2609. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `CelebornColumnarShuffleReaderSuite` Closes #2726 from SteNicholas/CELEBORN-912. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: xiyu.zk <xiyu.zk@alibaba-inc.com>

…temptId ### What changes were proposed in this pull request? Let attemptNumber = (stageAttemptId << 16) | taskAttemptNumber, to differentiate map results with only different stageAttemptId. ### Why are the changes needed? If we can't differentiate map tasks with only different stageAttemptId, it may lead to mixed reading of two map tasks' shuffle write batches during shuffle read, causing data correctness issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add ut: org.apache.spark.shuffle.celeborn.SparkShuffleManagerSuite#testWrongSparkConf_MaxAttemptLimit Closes apache#2609 from jiang13021/spark_stage_attempt_id. Lead-authored-by: jiang13021 <jiangyanze.jyz@antgroup.com> Co-authored-by: Fu Chen <cfmcgrady@gmail.com> Co-authored-by: Shuang <lvshuang.xjs@alibaba-inc.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>

### What changes were proposed in this pull request? Introduce `spark-3.5-columnar-shuffle` module to support columnar shuffle for Spark 3.5. Follow up apache#2710, apache#2609. ### Why are the changes needed? Tests of `CelebornColumnarShuffleReaderSuite` are failed for the changes of apache#2609. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `CelebornColumnarShuffleReaderSuite` Closes apache#2726 from SteNicholas/CELEBORN-912. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: xiyu.zk <xiyu.zk@alibaba-inc.com>

### What changes were proposed in this pull request? Adds support for barrier stages. This involves two aspects: a) If there is a task failure when executing a barrier stage, all shuffle output for the stage attempt are discarded and ignored. b) If there is a reexecution of a barrier stage (for ex, due to child stage getting a fetch failure), all shuffle output for the previous stage attempt are discarded and ignored. This is similar to handling of indeterminate stages when `throwsFetchFailure` is `true`. Note that this is supported only when `spark.celeborn.client.spark.fetch.throwsFetchFailure` is `true` ### Why are the changes needed? As detailed in CELEBORN-1518, Celeborn currently does not support barrier stages; which is an essential functionality in Apache Spark which is widely in use by Spark users. Enhancing Celeborn will allow its use for a wider set of Spark users. ### Does this PR introduce _any_ user-facing change? Adds ability for Celeborn to support Apache Spark Barrier stages. ### How was this patch tested? Existing tests, and additional tests (thanks to jiang13021 in apache#2609 - [see here](https://github.com/apache/celeborn/pull/2609/files#diff-e17f15fcca26ddfc412f0af159c784d72417b0f22598e1b1ebfcacd6d4c3ad35)) Closes apache#2639 from mridulm/fix-barrier-stage-reexecution. Lead-authored-by: Mridul Muralidharan <mridul@gmail.com> Co-authored-by: Mridul Muralidharan <mridulatgmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>

…temptId ### What changes were proposed in this pull request? Let attemptNumber = (stageAttemptId << 16) | taskAttemptNumber, to differentiate map results with only different stageAttemptId. ### Why are the changes needed? If we can't differentiate map tasks with only different stageAttemptId, it may lead to mixed reading of two map tasks' shuffle write batches during shuffle read, causing data correctness issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add ut: org.apache.spark.shuffle.celeborn.SparkShuffleManagerSuite#testWrongSparkConf_MaxAttemptLimit Closes apache#2609 from jiang13021/spark_stage_attempt_id. Lead-authored-by: jiang13021 <jiangyanze.jyz@antgroup.com> Co-authored-by: Fu Chen <cfmcgrady@gmail.com> Co-authored-by: Shuang <lvshuang.xjs@alibaba-inc.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>

### What changes were proposed in this pull request? Introduce `spark-3.5-columnar-shuffle` module to support columnar shuffle for Spark 3.5. Follow up apache#2710, apache#2609. ### Why are the changes needed? Tests of `CelebornColumnarShuffleReaderSuite` are failed for the changes of apache#2609. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `CelebornColumnarShuffleReaderSuite` Closes apache#2726 from SteNicholas/CELEBORN-912. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: xiyu.zk <xiyu.zk@alibaba-inc.com>

jiang13021 commented Jul 9, 2024

View reviewed changes

mridulm reviewed Jul 9, 2024

View reviewed changes

client-spark/spark-2/src/main/java/org/apache/spark/shuffle/celeborn/SparkUtils.java Outdated Show resolved Hide resolved

jiang13021 force-pushed the spark_stage_attempt_id branch from 2a9e963 to 79994ee Compare July 10, 2024 05:55

jiang13021 mentioned this pull request Jul 29, 2024

[CELEBORN-1518] Add support for Apache Spark barrier stages #2639

Closed

RexXiong mentioned this pull request Jul 31, 2024

[CELEBORN-1233] Add unit test to verify data correctness #2638

Closed

waitinfuture approved these changes Aug 2, 2024

View reviewed changes

...nt-spark/spark-2/src/main/java/org/apache/spark/shuffle/celeborn/HashBasedShuffleWriter.java Outdated Show resolved Hide resolved

mridulm reviewed Aug 3, 2024

View reviewed changes

jiang13021 and others added 3 commits August 12, 2024 15:38

[CELEBORN-1496] Differentiate map results with only different stageAt…

45fe4ee

…temptId

reproduce duplicate data problem and add ut

e0affd7

Add SparkCommonUtils.validateAttemptConfig

c0243d0

jiang13021 force-pushed the spark_stage_attempt_id branch from 069af3b to c0243d0 Compare August 12, 2024 07:44

fix

2bf9d3f

cfmcgrady reviewed Aug 12, 2024

View reviewed changes

client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java Outdated Show resolved Hide resolved

cfmcgrady reviewed Aug 12, 2024

View reviewed changes

common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala Outdated Show resolved Hide resolved

cfmcgrady reviewed Aug 12, 2024

View reviewed changes

...k/spark-2/src/test/scala/org/apache/spark/shuffle/celeborn/CelebornShuffleManagerSuite.scala Outdated Show resolved Hide resolved

Update client-spark/spark-2/src/test/scala/org/apache/spark/shuffle/c…

042fefc

…eleborn/CelebornShuffleManagerSuite.scala

cfmcgrady reviewed Aug 12, 2024

View reviewed changes

...k/spark-3/src/test/scala/org/apache/spark/shuffle/celeborn/CelebornShuffleManagerSuite.scala Outdated Show resolved Hide resolved

Update client-spark/spark-3/src/test/scala/org/apache/spark/shuffle/c…

b0ac8a7

…eleborn/CelebornShuffleManagerSuite.scala

fix

f547598

mridulm mentioned this pull request Aug 20, 2024

[CELEBORN-1498] Decide whether to reuse the shuffle id based on the appShuffle's numAvailableOutputs #2611

Closed

mridulm approved these changes Aug 20, 2024

View reviewed changes

jiang13021 added 5 commits August 23, 2024 16:10

Revert "fix"

7f16e8f

This reverts commit f547598.

Revert "[CELEBORN-1558] Fix the incorrect decrement of pendingWrites …

69d930f

…in handlePushMergeData" This reverts commit 6b24f5d.

trigger CI

a341dac

Revert "reproduce duplicate data problem and add ut"

e98331f

This reverts commit e0affd7.

Revert "Revert "[CELEBORN-1558] Fix the incorrect decrement of pendin…

53805b2

…gWrites in handlePushMergeData"" This reverts commit 69d930f.

fix

33e2935

RexXiong approved these changes Aug 29, 2024

View reviewed changes

RexXiong closed this in 3853075 Aug 30, 2024

cfmcgrady mentioned this pull request Aug 30, 2024

[CELEBORN-1496][0.4] Differentiate map results with only different stageAttemptId #2717

Closed

SteNicholas mentioned this pull request Sep 5, 2024

[CELEBORN-912][FOLLOWUP] Support columnar shuffle for Spark 3.5 #2726

Closed

waitinfuture mentioned this pull request Sep 5, 2024

[CELEBORN-1506][FOLLOWUP] InFlightRequestTracker should not reset totalInflightReqs for cleaning up to avoid negative totalInflightReqs for limitZeroInFlight #2725

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CELEBORN-1496] Differentiate map results with only different stageAttemptId #2609

[CELEBORN-1496] Differentiate map results with only different stageAttemptId #2609

jiang13021 commented Jul 9, 2024

jiang13021 Jul 9, 2024 •

edited

Loading

mridulm left a comment

waitinfuture left a comment

mridulm Aug 3, 2024

codecov bot commented Aug 12, 2024

mridulm left a comment

mridulm Aug 20, 2024

mridulm Aug 20, 2024

RexXiong commented Aug 21, 2024

RexXiong left a comment •

edited

Loading

[CELEBORN-1496] Differentiate map results with only different stageAttemptId #2609

[CELEBORN-1496] Differentiate map results with only different stageAttemptId #2609

Conversation

jiang13021 commented Jul 9, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

jiang13021 Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

mridulm left a comment

Choose a reason for hiding this comment

waitinfuture left a comment

Choose a reason for hiding this comment

mridulm Aug 3, 2024

Choose a reason for hiding this comment

codecov bot commented Aug 12, 2024

Codecov Report

mridulm left a comment

Choose a reason for hiding this comment

mridulm Aug 20, 2024

Choose a reason for hiding this comment

mridulm Aug 20, 2024

Choose a reason for hiding this comment

RexXiong commented Aug 21, 2024

RexXiong left a comment • edited Loading

Choose a reason for hiding this comment

jiang13021 Jul 9, 2024 •

edited

Loading

RexXiong left a comment •

edited

Loading