Skip to content

[SPARK-24552][core] Use unique id instead of attempt number for writes [branch-2.2]. #21616

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

vanzin
Copy link
Contributor

@vanzin vanzin commented Jun 22, 2018

This passes a unique attempt id to the Hadoop APIs, because attempt
number is reused when stages are retried. When attempt numbers are
reused, sources that track data by partition id and attempt number
may incorrectly clean up data because the same attempt number can
be both committed and aborted.

…s [branch-2.2].

This passes a unique attempt id to the Hadoop APIs, because attempt
number is reused when stages are retried. When attempt numbers are
reused, sources that track data by partition id and attempt number
may incorrectly clean up data because the same attempt number can
be both committed and aborted.
@tgravescs
Copy link
Contributor

+1 pending tests

@SparkQA
Copy link

SparkQA commented Jun 22, 2018

Test build #92227 has finished for PR 21616 at commit 88679a0.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 23, 2018

Test build #92236 has finished for PR 21616 at commit ab2f701.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Jun 25, 2018
…er for writes .

This passes a unique attempt id to the Hadoop APIs, because attempt
number is reused when stages are retried. When attempt numbers are
reused, sources that track data by partition id and attempt number
may incorrectly clean up data because the same attempt number can
be both committed and aborted.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #21616 from vanzin/SPARK-24552-2.2.
@vanzin
Copy link
Contributor Author

vanzin commented Jun 26, 2018

I should have checked first, but this doesn't merge to 2.1, and it doesn't look like 2.1 is affected anyway. There seems to be just one code path in 2.1 that hits this path, and it already uses a similar approach:

    val writer = new SparkHadoopWriter(hadoopConf)
    writer.preSetup()

    val writeToFile = (context: TaskContext, iter: Iterator[(K, V)]) => {
      // Hadoop wants a 32-bit task attempt ID, so if ours is bigger than Int.MaxValue, roll it
      // around by taking a mod. We expect that no task will be attempted 2 billion times.
      val taskAttemptId = (context.taskAttemptId % Int.MaxValue).toInt

That's in PairRDDFunctions.scala. There might be other paths affected, but at this point I'll leave it alone.

@vanzin
Copy link
Contributor Author

vanzin commented Jun 26, 2018

Merged to 2.2.

@vanzin vanzin closed this Jun 26, 2018
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…er for writes .

This passes a unique attempt id to the Hadoop APIs, because attempt
number is reused when stages are retried. When attempt numbers are
reused, sources that track data by partition id and attempt number
may incorrectly clean up data because the same attempt number can
be both committed and aborted.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes apache#21616 from vanzin/SPARK-24552-2.2.
@vanzin vanzin deleted the SPARK-24552-2.2 branch August 24, 2018 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants