[SPARK-24589][core] Correctly identify tasks in output commit coordinator. #21577

vanzin · 2018-06-15T23:41:59Z

When an output stage is retried, it's possible that tasks from the previous
attempt are still running. In that case, there would be a new task for the
same partition in the new attempt, and the coordinator would allow both
tasks to commit their output since it did not keep track of stage attempts.

The change adds more information to the stage state tracked by the coordinator,
so that only one task is allowed to commit the output in the above case.
The stage state in the coordinator is also maintained across stage retries,
so that a stray speculative task from a previous stage attempt is not allowed
to commit.

This also removes some code added in SPARK-18113 that allowed for duplicate
commit requests; with the RPC code used in Spark 2, that situation cannot
happen, so there is no need to handle it.

…ator. When an output stage is retried, it's possible that tasks from the previous attempt are still running. In that case, there would be a new task for the same partition in the new attempt, and the coordinator would allow both tasks to commit their output since it did not keep track of stage attempts. The change adds more information to the stage state tracked by the coordinator, so that only one task if allowed to commit the output in the above case. This also removes some code added in SPARK-18113 that allowed for duplicate commit requests; with the RPC code used in Spark 2, that situation cannot happen, so there is no need to handle it.

SparkQA · 2018-06-15T23:50:24Z

Test build #91949 has finished for PR 21577 at commit 09e5d15.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-16T04:24:00Z

Test build #91950 has finished for PR 21577 at commit d471b74.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-06-18T16:32:07Z

this was along the lines of what I was thinking as well. Will do a full review later.
Just curious if you were able to create a test to actually reproduce it?

From the other PR:

and data source v2 API assumes (job id, partition id, task attemp id) can uniquely define a write task, even counting the failure cases.

Are there other docs that need to be updated for v2 datasource api? @rdblue @cloud-fan

vanzin · 2018-06-18T16:53:38Z

I think Ryan's change might still be good to introduce (i.e. a change that replaces the attempt id in that code with something a little more unique), regardless of any fix here.

The unit tests I added artificially re-creates the calls that would lead to the situation, but I haven't tried to create a test case that would run things through the scheduler.

cloud-fan · 2018-06-18T17:37:02Z

core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala

@@ -81,7 +81,7 @@ object SparkHadoopMapRedUtil extends Logging {
          logInfo(message)
          // We need to abort the task so that the driver can reschedule new attempts, if necessary
          committer.abortTask(mrTaskContext)
-          throw new CommitDeniedException(message, stageId, splitId, taskAttemptNumber)
+          throw new CommitDeniedException(message, ctx.stageId(), splitId, ctx.attemptNumber())


shall we also include stage attempt number in the exception?

Sure. Was trying to minimize changes in the first version, for testing.

cloud-fan · 2018-06-18T17:44:15Z

Are there other docs that need to be updated for v2 datasource api?

Yes we need, but this can be done in a different PR

cloud-fan · 2018-06-18T17:44:37Z

thanks! the fix LGTM

vanzin · 2018-06-18T17:48:41Z

I'll try to create a test to exercise this in a real job (aside from the exception changes @cloud-fan suggested), but wouldn't hold my breath.

Rename the field to match what it actually is; except for the JSON-serialized version, which for backwards compatibility still uses "job" instead of "stage".

tgravescs · 2018-06-18T18:11:09Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

+          state.authorizedCommitters(partition) = TaskIdentifier(stageAttempt, attemptNumber)
+          true
+        } else {
+          logDebug(s"Commit denied for stage=$stage/$attemptNumber, partition=$partition: " +


would be nice to include the stage attempt in the log messages as well.

vanzin · 2018-06-18T18:51:19Z

I tried to create a test based on actually running a job, but I'd have to do a lot of hacking to control what the result stage does, and it was starting to feel not much better than the unit test I added here already, so I gave up.

SparkQA · 2018-06-18T19:07:02Z

Test build #92039 has finished for PR 21577 at commit a2f4c1b.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2018-06-18T20:31:26Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

          logDebug(s"Authorized committer (attemptNumber=$attemptNumber, stage=$stage, " +
            s"partition=$partition) failed; clearing lock")
-          stageState.authorizedCommitters(partition) = NO_AUTHORIZED_COMMITTER
+          stageState.authorizedCommitters(partition) = null


Nit: why not use Option[TaskIdentifier] and None here?

Less memory usage, at least. Not sure what advantage using Option would bring here.

rdblue · 2018-06-18T20:31:51Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

-          s"attempt: $attemptNumber")
-      case otherReason =>
+      case _: TaskCommitDenied =>
+        logInfo(s"Task was denied committing, stage: $stage / $stageAttempt, " +


Nit: Should this be s"$stage.$stageAttempt"?

That looks better and saves some space, so will do.

rdblue · 2018-06-18T20:35:03Z

core/src/main/scala/org/apache/spark/util/JsonProtocol.scala

@@ -399,7 +399,8 @@ private[spark] object JsonProtocol {
        ("Full Stack Trace" -> exceptionFailure.fullStackTrace) ~
        ("Accumulator Updates" -> accumUpdates)
      case taskCommitDenied: TaskCommitDenied =>
-        ("Job ID" -> taskCommitDenied.jobID) ~
+        ("Job ID" -> taskCommitDenied.stageID) ~
+        ("Job Attempt Number" -> taskCommitDenied.stageAttempt) ~


Why does this use "Job" and not "Stage"?

Also, will this affect the compatibility of the history server files?

For the new property, I'm just following what the old property says, even though it's wrong. I think having Job ID and Stage Attempt Number would just be even more confusing...

And it shouldn't affect compatibility. Given the code, even an old history server would be able to read these new log files.

rdblue · 2018-06-18T20:40:14Z

+1. This fixes the commit coordinator problem where two separate tasks can be authorized. That case could lead to duplicate data (if, for example, both tasks generated unique file names using a random UUID).

However, this doesn't address the problem I hit in practice, where a file was created twice and deleted once because the same task attempt number was both allowed to commit by the coordinator and denied commit by the coordinator (after the stage had finished).

We still need the solution proposed in #21558 for the v2 API. But that's more of a v2 API problem because that API makes the guarantee that implementations can rely on the attempt ID.

mridulm · 2018-06-18T21:42:33Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

@@ -109,20 +116,21 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean)
   * @param maxPartitionId the maximum partition id that could appear in this stage's tasks (i.e.
   *                       the maximum possible value of `context.partitionId`).
   */
-  private[scheduler] def stageStart(stage: StageId, maxPartitionId: Int): Unit = synchronized {
+  private[scheduler] def stageStart(stage: Int, maxPartitionId: Int): Unit = synchronized {
    stageStates(stage) = new StageState(maxPartitionId + 1)


My memory is a bit rusty here, but are we changing the semantics of which task can commit here ?
Couple of queries:

Are we allowing task from a previous stage attempt to commit for current stage attempt ? (after previous stage attempt has failed/finished).

Based on TaskIdentifier above, I this yes ?

If no, should we check and reject commit requests from tasks from 'older' stage when the current stage attempt is different.

I don't think the semantics are changing. It's always been racy, in that either of the concurrent tasks from different stage attempts may succeed first. And I'm almost sure the assumption is that both task attempts are equivalent (i.e. the output is deterministic or at least should be), so it should be fine for either to be committed.

The problem is that without this change the coordinator would allow both attempts to commit, and that is kinda bad.

There are two cases here (both not handled in existing/earlier code).

Handled in PR:

Stage S1 attempt A1 launched.

Tasks T1_1 launched for partition P1

A1 fails

Stage S1 attempt A2 launched.

Tasks T1_2 for partition P1 launched.

T1_1 finishes, and is allowed to commit.

IMO not handled in PR:

Stage S1 attempt A1 launched.

Tasks T1_1.1 launched for partition P1

Tasks T1_1.2 launched for partition P1 (speculative)

Task T1_1.1 committed.

A1 fails

Stage S1 attempt A2 launched for some other pending partitions.

Tasks T1_1.2 wants to commit.

T1_1.2 will be allowed to commit.
Now we have two tasks for same partition successfully committing.

Did I miss something here ?

T1_1.2 will not be allowed, it has a different task attempt number.

If I read @vanzin's PR right, T1_1.2 will be allowed to commit - since there is a stageEnd + stageStart in between (which clear the existing stage state).

Yeah, I think this can happen. The problem is that with the current way it's used, the output committer forgets the commit status between stage retries. I think the right thing would be for the committer to keep the stage-related state until the scheduler is done with all its attempts.

whether it will generate corrupted data when the commit process of T1_1.1 didn't finish

I don't think that's the problem. The problem is that if both the initial task and the speculative task finish successfully, but across stage attempt barriers (so the output committer is "reset" in between), both will be allowed to commit, so you get duplicate data.

So in scenario 2, once the first task finishes and is committed, the taskset manager will kill the speculative task T1_1.2. But since it sends an async message to kill the task, the task could actually try to commit after another task fails and causes the stage to remove itself from the output commit coordinator and after it starts another stage attempt. So it could actually end up committed the task output for T1_1.2. I'm not sure this case by itself is a problem though since if it actually committed T1_1.1 and T1_1.2 is allowed to commit, they should have the same output and commitJob would handle in at least most cases. The caveat there though would be if since T1_1.1 was committed, the second stage attempt could finish and call commitJob while T1_1.2 is committing since spark thinks it doesn't need to wait for T1_1.2. Anyway this seems very unlikely but we should protect against it.

There is another case though here where T1_1.1 could have just asked to be committed, but not yet committed, then if it gets delayed committing, the new stage attempt starts and T1_1.2 asks if it could commit and is granted, so then both try to commit at the same time causing corruption.

I think the right thing would be for the committer to keep the stage-related state until the scheduler is done with all its attempts.

We should change the DAGScheduler a little bit that, if a stage is killed and going to be re-tried, do not clear the stage states in output coordinator.

Yep, I'm going down that path. I just want to add a proper test to make sure the behavior is correct and that's a little bit tricky.

I agree @vanzin, @cloud-fan. We should remove the stage info only after the stage is done.

vanzin · 2018-06-18T22:39:13Z

We still need the solution proposed in #21558 for the v2 API

Should we have a separate bug for these then? I just piggybacked on the bug you filed, but if they're separate issues, even if complementary, might be better to separate them.

vanzin · 2018-06-19T01:33:46Z

FYI I plan to fix the mima issue later today (still fighting with 2.1 builds). Haven't decided whether to revert the change or just add excludes... probably the latter since it's a developer api.

tgravescs · 2018-06-19T13:57:21Z

So I think the commit/delete thing is also an issue for existing v1 and hadoop committers as well. So this doesn't fully solve the problem. spark uses a file format like (HadoopMapReduceWriteConfigUtil/HadoopMapRedWriteConfigUtil):

{date}_{rddid}_{m/r}_{partitionid}_{task attempt number}

I believe the same fix as the v2 would work using the taskAttemptId instead of the attemptNumber.

In the case we have the stage failure and a second stage attempt the task attempt number could be the same and thus both tasks write to the same place. If one of them fails or is told not to commit it could delete the output which is being used by both.

Need to think through all the scenarios to make sure its covered.

jiangxb1987 · 2018-06-19T15:40:38Z

This in general looks good, IMO we shall focus on fixing the output commit coordinator issue in this PR, and discuss the data source issue in a separated thread.
I'm OOO this week but will still look into more detail on this issue.

tgravescs · 2018-06-19T16:29:32Z

I'm fine with separating them but we need a jira or need to update the v2 jira to handle all cases

This reverts commit 5437d4a.

tgravescs · 2018-06-20T17:41:25Z

Ah, right, d'oh. I just checked about whether stages register with the coordinator, and saw the duplicate registration for the resubmitted map stage.

Yeah I noticed that to but I think we should perhaps file separate jira and only do that in 2.4 and maybe 2.3.2 just to limit changes for now

vanzin · 2018-06-20T17:43:02Z

Sounds good to me (although I'm trying the change locally and unit tests are so far happy).

vanzin · 2018-06-20T18:10:53Z

I filed SPARK-24611 to track some enhancements to this part of the code that have been discussed here. Of those, I'd consider the "use task IDs instead of TaskIdentifier" as something we could potentially do here, but at the same time I don't really want to delay this patch too much.

vanzin · 2018-06-20T18:18:16Z

I pushed the change for that in: vanzin@e6a862e

In case anyone wants to take a look.

SparkQA · 2018-06-20T18:59:26Z

Test build #92138 has finished for PR 21577 at commit 264c533.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-06-20T21:53:39Z

the code here lgtm, I was trying to make one more pass through all the scenarios but got stuck in meetings, will try to do it later tonight or tomorrow morning but we can always have another follow up if we find another case.

jiangxb1987

LGTM too

tgravescs · 2018-06-21T14:26:40Z

+1

this is a bit of a side while looking through the scenarios I filed: https://issues.apache.org/jira/browse/SPARK-24622 . shouldn't be a problem here though with this fix.

vanzin · 2018-06-21T17:04:02Z

So anyone wants to do the actual merging?

tgravescs · 2018-06-21T18:09:04Z

I will

cloud-fan · 2018-06-21T18:27:29Z

I pushed the change for that in: vanzin/spark@e6a862e

I like it, it's simpler to use task id to replace stage attempt id and task attempt id. For safety we should do it in master only after this PR is merged.

…ator. When an output stage is retried, it's possible that tasks from the previous attempt are still running. In that case, there would be a new task for the same partition in the new attempt, and the coordinator would allow both tasks to commit their output since it did not keep track of stage attempts. The change adds more information to the stage state tracked by the coordinator, so that only one task is allowed to commit the output in the above case. The stage state in the coordinator is also maintained across stage retries, so that a stray speculative task from a previous stage attempt is not allowed to commit. This also removes some code added in SPARK-18113 that allowed for duplicate commit requests; with the RPC code used in Spark 2, that situation cannot happen, so there is no need to handle it. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #21577 from vanzin/SPARK-24552. (cherry picked from commit c8e909c) Signed-off-by: Thomas Graves <tgraves@apache.org>

tgravescs · 2018-06-21T21:18:03Z

merged to master, 2.3, and 2.2

zzcclp · 2018-06-22T02:28:38Z

@vanzin @tgravescs , after merge this pr into branch-2.2, there is an error "stageAttemptNumber is not a member of org.apache.spark.TaskContext" in SparkHadoopMapRedUtil, I think it needs to merge PR-20082 first.

tgravescs · 2018-06-22T13:25:04Z

Yeah sorry about that, my fault. I merged the fix - SPARK-22897

…ator [branch-2.1]. When an output stage is retried, it's possible that tasks from the previous attempt are still running. In that case, there would be a new task for the same partition in the new attempt, and the coordinator would allow both tasks to commit their output since it did not keep track of stage attempts. The change adds more information to the stage state tracked by the coordinator, so that only one task is allowed to commit the output in the above case. The stage state in the coordinator is also maintained across stage retries, so that a stray speculative task from a previous stage attempt is not allowed to commit. This also removes some code added in SPARK-18113 that allowed for duplicate commit requests; with the RPC code used in Spark 2, that situation cannot happen, so there is no need to handle it. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#21577 from vanzin/SPARK-24552. (cherry picked from commit c8e909c) Signed-off-by: Thomas Graves <tgraves@apache.org> (cherry picked from commit 751b008) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

rdblue · 2018-06-22T19:13:06Z

Thanks for fixing this, @vanzin!

…ator. When an output stage is retried, it's possible that tasks from the previous attempt are still running. In that case, there would be a new task for the same partition in the new attempt, and the coordinator would allow both tasks to commit their output since it did not keep track of stage attempts. The change adds more information to the stage state tracked by the coordinator, so that only one task is allowed to commit the output in the above case. The stage state in the coordinator is also maintained across stage retries, so that a stray speculative task from a previous stage attempt is not allowed to commit. This also removes some code added in SPARK-18113 that allowed for duplicate commit requests; with the RPC code used in Spark 2, that situation cannot happen, so there is no need to handle it. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#21577 from vanzin/SPARK-24552. (cherry picked from commit c8e909c) Signed-off-by: Thomas Graves <tgraves@apache.org>

…ator. When an output stage is retried, it's possible that tasks from the previous attempt are still running. In that case, there would be a new task for the same partition in the new attempt, and the coordinator would allow both tasks to commit their output since it did not keep track of stage attempts. The change adds more information to the stage state tracked by the coordinator, so that only one task is allowed to commit the output in the above case. The stage state in the coordinator is also maintained across stage retries, so that a stray speculative task from a previous stage attempt is not allowed to commit. This also removes some code added in SPARK-18113 that allowed for duplicate commit requests; with the RPC code used in Spark 2, that situation cannot happen, so there is no need to handle it. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#21577 from vanzin/SPARK-24552.

…ator. When an output stage is retried, it's possible that tasks from the previous attempt are still running. In that case, there would be a new task for the same partition in the new attempt, and the coordinator would allow both tasks to commit their output since it did not keep track of stage attempts. The change adds more information to the stage state tracked by the coordinator, so that only one task is allowed to commit the output in the above case. The stage state in the coordinator is also maintained across stage retries, so that a stray speculative task from a previous stage attempt is not allowed to commit. This also removes some code added in SPARK-18113 that allowed for duplicate commit requests; with the RPC code used in Spark 2, that situation cannot happen, so there is no need to handle it. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#21577 from vanzin/SPARK-24552. (cherry picked from commit c8e909c) Signed-off-by: Thomas Graves <tgraves@apache.org>

…ator [branch-2.1]. When an output stage is retried, it's possible that tasks from the previous attempt are still running. In that case, there would be a new task for the same partition in the new attempt, and the coordinator would allow both tasks to commit their output since it did not keep track of stage attempts. The change adds more information to the stage state tracked by the coordinator, so that only one task is allowed to commit the output in the above case. The stage state in the coordinator is also maintained across stage retries, so that a stray speculative task from a previous stage attempt is not allowed to commit. This also removes some code added in SPARK-18113 that allowed for duplicate commit requests; with the RPC code used in Spark 2, that situation cannot happen, so there is no need to handle it. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#21577 from vanzin/SPARK-24552. (cherry picked from commit c8e909c) Signed-off-by: Thomas Graves <tgraves@apache.org> (cherry picked from commit 751b008) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

vanzin mentioned this pull request Jun 15, 2018

[SPARK-24552][SQL] Use task ID instead of attempt number for v2 writes. #21558

Closed

Marcelo Vanzin added 2 commits June 15, 2018 16:55

Fix sql compilation.

d471b74

Cleanup.

1cde305

cloud-fan reviewed Jun 18, 2018

View reviewed changes

Add stage ID to CommitDeniedException.

5437d4a

Rename the field to match what it actually is; except for the JSON-serialized version, which for backwards compatibility still uses "job" instead of "stage".

tgravescs reviewed Jun 18, 2018

View reviewed changes

Fix log message (task attempt -> stage attempt).

a2f4c1b

vanzin changed the title ~~[WIP] [SPARK-24552][core] Correctly identify tasks in output commit coordinator.~~ [SPARK-24552][core] Correctly identify tasks in output commit coordinator. Jun 18, 2018

rdblue reviewed Jun 18, 2018

View reviewed changes

mridulm reviewed Jun 18, 2018

View reviewed changes

vanzin changed the title ~~[SPARK-24552][core] Correctly identify tasks in output commit coordinator.~~ [SPARK-24589][core] Correctly identify tasks in output commit coordinator. Jun 19, 2018

Revert "Add stage ID to CommitDeniedException."

37ff307

This reverts commit 5437d4a.

jiangxb1987 approved these changes Jun 21, 2018

View reviewed changes

asfgit closed this in c8e909c Jun 21, 2018

vanzin deleted the SPARK-24552 branch June 21, 2018 20:13

[SPARK-24589][core] Correctly identify tasks in output commit coordinator. #21577

[SPARK-24589][core] Correctly identify tasks in output commit coordinator. #21577

Uh oh!

Conversation

vanzin commented Jun 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jun 15, 2018

Uh oh!

SparkQA commented Jun 16, 2018

Uh oh!

tgravescs commented Jun 18, 2018

Uh oh!

vanzin commented Jun 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jun 18, 2018

Uh oh!

cloud-fan commented Jun 18, 2018

Uh oh!

vanzin commented Jun 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanzin commented Jun 18, 2018

Uh oh!

SparkQA commented Jun 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jun 18, 2018

Uh oh!

mridulm Jun 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanzin Jun 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanzin commented Jun 18, 2018

Uh oh!

vanzin commented Jun 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tgravescs commented Jun 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

vanzin commented Jun 15, 2018 •

edited

Loading

mridulm Jun 18, 2018 •

edited

Loading

vanzin Jun 19, 2018 •

edited

Loading

vanzin commented Jun 19, 2018 •

edited

Loading

tgravescs commented Jun 19, 2018 •

edited

Loading

tgravescs commented Jun 20, 2018 •

edited

Loading