forked from apache/celeborn
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[CELEBORN-1518] Add support for Apache Spark barrier stages
### What changes were proposed in this pull request? Adds support for barrier stages. This involves two aspects: a) If there is a task failure when executing a barrier stage, all shuffle output for the stage attempt are discarded and ignored. b) If there is a reexecution of a barrier stage (for ex, due to child stage getting a fetch failure), all shuffle output for the previous stage attempt are discarded and ignored. This is similar to handling of indeterminate stages when `throwsFetchFailure` is `true`. Note that this is supported only when `spark.celeborn.client.spark.fetch.throwsFetchFailure` is `true` ### Why are the changes needed? As detailed in CELEBORN-1518, Celeborn currently does not support barrier stages; which is an essential functionality in Apache Spark which is widely in use by Spark users. Enhancing Celeborn will allow its use for a wider set of Spark users. ### Does this PR introduce _any_ user-facing change? Adds ability for Celeborn to support Apache Spark Barrier stages. ### How was this patch tested? Existing tests, and additional tests (thanks to jiang13021 in apache#2609 - [see here](https://github.com/apache/celeborn/pull/2609/files#diff-e17f15fcca26ddfc412f0af159c784d72417b0f22598e1b1ebfcacd6d4c3ad35)) Closes apache#2639 from mridulm/fix-barrier-stage-reexecution. Lead-authored-by: Mridul Muralidharan <mridul@gmail.com> Co-authored-by: Mridul Muralidharan <mridulatgmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
- Loading branch information
1 parent
a759efb
commit 3234bef
Showing
15 changed files
with
447 additions
and
50 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
43 changes: 43 additions & 0 deletions
43
client-spark/spark-2/src/main/scala/org/apache/spark/celeborn/ExceptionMakerHelper.scala
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.celeborn | ||
|
||
import org.apache.spark.shuffle.FetchFailedException | ||
|
||
import org.apache.celeborn.common.util.ExceptionMaker | ||
|
||
object ExceptionMakerHelper { | ||
|
||
val FETCH_FAILURE_ERROR_MSG = "Celeborn FetchFailure with shuffle id " | ||
|
||
val SHUFFLE_FETCH_FAILURE_EXCEPTION_MAKER = new ExceptionMaker() { | ||
override def makeFetchFailureException( | ||
appShuffleId: Int, | ||
shuffleId: Int, | ||
partitionId: Int, | ||
e: Exception): Exception = { | ||
new FetchFailedException( | ||
null, | ||
appShuffleId, | ||
-1, | ||
partitionId, | ||
FETCH_FAILURE_ERROR_MSG + appShuffleId + "/" + shuffleId, | ||
e) | ||
} | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
44 changes: 44 additions & 0 deletions
44
client-spark/spark-3/src/main/scala/org/apache/spark/celeborn/ExceptionMakerHelper.scala
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.celeborn | ||
|
||
import org.apache.spark.shuffle.FetchFailedException | ||
|
||
import org.apache.celeborn.common.util.ExceptionMaker | ||
|
||
object ExceptionMakerHelper { | ||
|
||
val FETCH_FAILURE_ERROR_MSG = "Celeborn FetchFailure with shuffle id " | ||
|
||
val SHUFFLE_FETCH_FAILURE_EXCEPTION_MAKER = new ExceptionMaker() { | ||
override def makeFetchFailureException( | ||
appShuffleId: Int, | ||
shuffleId: Int, | ||
partitionId: Int, | ||
e: Exception): Exception = { | ||
new FetchFailedException( | ||
null, | ||
appShuffleId, | ||
-1, | ||
-1, | ||
partitionId, | ||
FETCH_FAILURE_ERROR_MSG + appShuffleId + "/" + shuffleId, | ||
e) | ||
} | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.