[SPARK-12087][Streaming] Create new JobConf for every batch in saveAsHadoopFiles #10088

tdas · 2015-12-02T03:01:32Z

The JobConf object created in DStream.saveAsHadoopFiles is used concurrently in multiple places:

The JobConf is updated by RDD.saveAsHadoopFile() before the job is launched
The JobConf is serialized as part of the DStream checkpoints.
These concurrent accesses (updating in one thread, while the another thread is serializing it) can lead to concurrentModidicationException in the underlying Java hashmap using in the internal Hadoop Configuration object.

The solution is to create a new JobConf in every batch, that is updated by RDD.saveAsHadoopFile(), while the checkpointing serializes the original JobConf.

Tests to be added in #9988 will fail reliably without this patch. Keeping this patch really small to make sure that it can be added to previous branches.

tdas · 2015-12-02T03:02:08Z

@zsxwing Please take a look. This should be merged to older branches if possible. And it blocks #9988 .

SparkQA · 2015-12-02T04:12:39Z

Test build #47032 has finished for PR 10088 at commit 7ff8174.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-02T04:17:25Z

Test build #2147 has finished for PR 10088 at commit 7ff8174.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2015-12-02T05:00:38Z

LGTM

zsxwing · 2015-12-02T05:04:35Z

merging it

…HadoopFiles The JobConf object created in `DStream.saveAsHadoopFiles` is used concurrently in multiple places: * The JobConf is updated by `RDD.saveAsHadoopFile()` before the job is launched * The JobConf is serialized as part of the DStream checkpoints. These concurrent accesses (updating in one thread, while the another thread is serializing it) can lead to concurrentModidicationException in the underlying Java hashmap using in the internal Hadoop Configuration object. The solution is to create a new JobConf in every batch, that is updated by `RDD.saveAsHadoopFile()`, while the checkpointing serializes the original JobConf. Tests to be added in #9988 will fail reliably without this patch. Keeping this patch really small to make sure that it can be added to previous branches. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #10088 from tdas/SPARK-12087. (cherry picked from commit 8a75a30) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

zsxwing · 2015-12-02T05:06:40Z

Merged to master, 1.6, 1.5 and 1.4.

Create new JobConf for every batch

7ff8174

tdas mentioned this pull request Dec 2, 2015

[SPARK-11932][STREAMING] Partition previous TrackStateRDD if partitioner not present #9988

Closed

asfgit closed this in 8a75a30 Dec 2, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-12087][Streaming] Create new JobConf for every batch in saveAsHadoopFiles #10088

[SPARK-12087][Streaming] Create new JobConf for every batch in saveAsHadoopFiles #10088

Uh oh!

tdas commented Dec 2, 2015

Uh oh!

tdas commented Dec 2, 2015

Uh oh!

SparkQA commented Dec 2, 2015

Uh oh!

SparkQA commented Dec 2, 2015

Uh oh!

zsxwing commented Dec 2, 2015

Uh oh!

zsxwing commented Dec 2, 2015

Uh oh!

zsxwing commented Dec 2, 2015

Uh oh!

Uh oh!

[SPARK-12087][Streaming] Create new JobConf for every batch in saveAsHadoopFiles #10088

[SPARK-12087][Streaming] Create new JobConf for every batch in saveAsHadoopFiles #10088

Uh oh!

Conversation

tdas commented Dec 2, 2015

Uh oh!

tdas commented Dec 2, 2015

Uh oh!

SparkQA commented Dec 2, 2015

Uh oh!

SparkQA commented Dec 2, 2015

Uh oh!

zsxwing commented Dec 2, 2015

Uh oh!

zsxwing commented Dec 2, 2015

Uh oh!

zsxwing commented Dec 2, 2015

Uh oh!

Uh oh!