Skip to content

[SPARK-12087][Streaming] Create new JobConf for every batch in saveAsHadoopFiles #10088

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

tdas
Copy link
Contributor

@tdas tdas commented Dec 2, 2015

The JobConf object created in DStream.saveAsHadoopFiles is used concurrently in multiple places:

  • The JobConf is updated by RDD.saveAsHadoopFile() before the job is launched
  • The JobConf is serialized as part of the DStream checkpoints.
    These concurrent accesses (updating in one thread, while the another thread is serializing it) can lead to concurrentModidicationException in the underlying Java hashmap using in the internal Hadoop Configuration object.

The solution is to create a new JobConf in every batch, that is updated by RDD.saveAsHadoopFile(), while the checkpointing serializes the original JobConf.

Tests to be added in #9988 will fail reliably without this patch. Keeping this patch really small to make sure that it can be added to previous branches.

@tdas
Copy link
Contributor Author

tdas commented Dec 2, 2015

@zsxwing Please take a look. This should be merged to older branches if possible. And it blocks #9988 .

@SparkQA
Copy link

SparkQA commented Dec 2, 2015

Test build #47032 has finished for PR 10088 at commit 7ff8174.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 2, 2015

Test build #2147 has finished for PR 10088 at commit 7ff8174.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented Dec 2, 2015

LGTM

@zsxwing
Copy link
Member

zsxwing commented Dec 2, 2015

merging it

asfgit pushed a commit that referenced this pull request Dec 2, 2015
…HadoopFiles

The JobConf object created in `DStream.saveAsHadoopFiles` is used concurrently in multiple places:
* The JobConf is updated by `RDD.saveAsHadoopFile()` before the job is launched
* The JobConf is serialized as part of the DStream checkpoints.
These concurrent accesses (updating in one thread, while the another thread is serializing it) can lead to concurrentModidicationException in the underlying Java hashmap using in the internal Hadoop Configuration object.

The solution is to create a new JobConf in every batch, that is updated by `RDD.saveAsHadoopFile()`, while the checkpointing serializes the original JobConf.

Tests to be added in #9988 will fail reliably without this patch. Keeping this patch really small to make sure that it can be added to previous branches.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #10088 from tdas/SPARK-12087.

(cherry picked from commit 8a75a30)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
asfgit pushed a commit that referenced this pull request Dec 2, 2015
…HadoopFiles

The JobConf object created in `DStream.saveAsHadoopFiles` is used concurrently in multiple places:
* The JobConf is updated by `RDD.saveAsHadoopFile()` before the job is launched
* The JobConf is serialized as part of the DStream checkpoints.
These concurrent accesses (updating in one thread, while the another thread is serializing it) can lead to concurrentModidicationException in the underlying Java hashmap using in the internal Hadoop Configuration object.

The solution is to create a new JobConf in every batch, that is updated by `RDD.saveAsHadoopFile()`, while the checkpointing serializes the original JobConf.

Tests to be added in #9988 will fail reliably without this patch. Keeping this patch really small to make sure that it can be added to previous branches.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #10088 from tdas/SPARK-12087.

(cherry picked from commit 8a75a30)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
asfgit pushed a commit that referenced this pull request Dec 2, 2015
…HadoopFiles

The JobConf object created in `DStream.saveAsHadoopFiles` is used concurrently in multiple places:
* The JobConf is updated by `RDD.saveAsHadoopFile()` before the job is launched
* The JobConf is serialized as part of the DStream checkpoints.
These concurrent accesses (updating in one thread, while the another thread is serializing it) can lead to concurrentModidicationException in the underlying Java hashmap using in the internal Hadoop Configuration object.

The solution is to create a new JobConf in every batch, that is updated by `RDD.saveAsHadoopFile()`, while the checkpointing serializes the original JobConf.

Tests to be added in #9988 will fail reliably without this patch. Keeping this patch really small to make sure that it can be added to previous branches.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #10088 from tdas/SPARK-12087.

(cherry picked from commit 8a75a30)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
@asfgit asfgit closed this in 8a75a30 Dec 2, 2015
@zsxwing
Copy link
Member

zsxwing commented Dec 2, 2015

Merged to master, 1.6, 1.5 and 1.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants