[SPARK-24552][core] Use unique id instead of attempt number for writes [branch-2.2]. #21616

vanzin · 2018-06-22T20:25:26Z

This passes a unique attempt id to the Hadoop APIs, because attempt
number is reused when stages are retried. When attempt numbers are
reused, sources that track data by partition id and attempt number
may incorrectly clean up data because the same attempt number can
be both committed and aborted.

…s [branch-2.2]. This passes a unique attempt id to the Hadoop APIs, because attempt number is reused when stages are retried. When attempt numbers are reused, sources that track data by partition id and attempt number may incorrectly clean up data because the same attempt number can be both committed and aborted.

tgravescs · 2018-06-22T21:16:54Z

+1 pending tests

SparkQA · 2018-06-22T23:38:09Z

Test build #92227 has finished for PR 21616 at commit 88679a0.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-23T02:51:53Z

Test build #92236 has finished for PR 21616 at commit ab2f701.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…er for writes . This passes a unique attempt id to the Hadoop APIs, because attempt number is reused when stages are retried. When attempt numbers are reused, sources that track data by partition id and attempt number may incorrectly clean up data because the same attempt number can be both committed and aborted. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #21616 from vanzin/SPARK-24552-2.2.

vanzin · 2018-06-26T00:07:37Z

I should have checked first, but this doesn't merge to 2.1, and it doesn't look like 2.1 is affected anyway. There seems to be just one code path in 2.1 that hits this path, and it already uses a similar approach:

    val writer = new SparkHadoopWriter(hadoopConf)
    writer.preSetup()

    val writeToFile = (context: TaskContext, iter: Iterator[(K, V)]) => {
      // Hadoop wants a 32-bit task attempt ID, so if ours is bigger than Int.MaxValue, roll it
      // around by taking a mod. We expect that no task will be attempted 2 billion times.
      val taskAttemptId = (context.taskAttemptId % Int.MaxValue).toInt

That's in PairRDDFunctions.scala. There might be other paths affected, but at this point I'll leave it alone.

vanzin · 2018-06-26T00:07:44Z

Merged to 2.2.

…er for writes . This passes a unique attempt id to the Hadoop APIs, because attempt number is reused when stages are retried. When attempt numbers are reused, sources that track data by partition id and attempt number may incorrectly clean up data because the same attempt number can be both committed and aborted. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#21616 from vanzin/SPARK-24552-2.2.

Typo.

ab2f701

vanzin mentioned this pull request Jun 25, 2018

[SPARK-24552][core][SQL] Use task ID instead of attempt number for writes. #21606

Closed

vanzin closed this Jun 26, 2018

vanzin deleted the SPARK-24552-2.2 branch August 24, 2018 19:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-24552][core] Use unique id instead of attempt number for writes [branch-2.2]. #21616

[SPARK-24552][core] Use unique id instead of attempt number for writes [branch-2.2]. #21616

Uh oh!

vanzin commented Jun 22, 2018

Uh oh!

tgravescs commented Jun 22, 2018

Uh oh!

SparkQA commented Jun 22, 2018

Uh oh!

SparkQA commented Jun 23, 2018

Uh oh!

vanzin commented Jun 26, 2018

Uh oh!

vanzin commented Jun 26, 2018

Uh oh!

Uh oh!

[SPARK-24552][core] Use unique id instead of attempt number for writes [branch-2.2]. #21616

[SPARK-24552][core] Use unique id instead of attempt number for writes [branch-2.2]. #21616

Uh oh!

Conversation

vanzin commented Jun 22, 2018

Uh oh!

tgravescs commented Jun 22, 2018

Uh oh!

SparkQA commented Jun 22, 2018

Uh oh!

SparkQA commented Jun 23, 2018

Uh oh!

vanzin commented Jun 26, 2018

Uh oh!

vanzin commented Jun 26, 2018

Uh oh!

Uh oh!