Skip to content

Commit f14f259

Browse files
committed
Add configuration option to control cloning of Hadoop JobConf.
1 parent b562451 commit f14f259

File tree

2 files changed

+48
-5
lines changed

2 files changed

+48
-5
lines changed

core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala

Lines changed: 39 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -122,20 +122,54 @@ class HadoopRDD[K, V](
122122
minPartitions)
123123
}
124124

125+
protected val jobConfCacheKey = "rdd_%d_job_conf".format(id)
126+
125127
protected val inputFormatCacheKey = "rdd_%d_input_format".format(id)
126128

127129
// used to build JobTracker ID
128130
private val createTime = new Date()
129131

132+
private val shouldCloneJobConf = sc.conf.get("spark.hadoop.cloneConf", "false").toBoolean
133+
130134
// Returns a JobConf that will be used on slaves to obtain input splits for Hadoop reads.
131135
protected def getJobConf(): JobConf = {
132136
val conf: Configuration = broadcastedConf.value.value
133-
HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
134-
val newJobConf = new JobConf(conf)
135-
if (!conf.isInstanceOf[JobConf]) {
136-
initLocalJobConfFuncOpt.map(f => f(newJobConf))
137+
if (shouldCloneJobConf) {
138+
// Hadoop Configuration objects are not thread-safe, which may lead to various problems if
139+
// one job modifies a configuration while another reads it (SPARK-2546). This problem occurs
140+
// somewhat rarely because most jobs treat the configuration as though it's immutable. One
141+
// solution, implemented here, is to clone the Configuration object. Unfortunately, this
142+
// clone can be very expensive. To avoid unexpected performance regressions for workloads and
143+
// Hadoop versions that do not suffer from these thread-safety issues, this cloning is
144+
// disabled by default.
145+
HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
146+
logDebug("Cloning Hadoop Configuration")
147+
val newJobConf = new JobConf(conf)
148+
if (!conf.isInstanceOf[JobConf]) {
149+
initLocalJobConfFuncOpt.map(f => f(newJobConf))
150+
}
151+
newJobConf
152+
}
153+
} else {
154+
if (conf.isInstanceOf[JobConf]) {
155+
logDebug("Re-using user-broadcasted JobConf")
156+
conf.asInstanceOf[JobConf]
157+
} else if (HadoopRDD.containsCachedMetadata(jobConfCacheKey)) {
158+
logDebug("Re-using cached JobConf")
159+
HadoopRDD.getCachedMetadata(jobConfCacheKey).asInstanceOf[JobConf]
160+
} else {
161+
// Create a JobConf that will be cached and used across this RDD's getJobConf() calls in the
162+
// local process. The local cache is accessed through HadoopRDD.putCachedMetadata().
163+
// The caching helps minimize GC, since a JobConf can contain ~10KB of temporary objects.
164+
// Synchronize to prevent ConcurrentModificationException (SPARK-1097, HADOOP-10456).
165+
HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
166+
logDebug("Creating new JobConf and caching it for later re-use")
167+
val newJobConf = new JobConf(conf)
168+
initLocalJobConfFuncOpt.map(f => f(newJobConf))
169+
HadoopRDD.putCachedMetadata(jobConfCacheKey, newJobConf)
170+
newJobConf
171+
}
137172
}
138-
newJobConf
139173
}
140174
}
141175

docs/configuration.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -590,6 +590,15 @@ Apart from these, the following properties are also available, and may be useful
590590
output directories. We recommend that users do not disable this except if trying to achieve compatibility with
591591
previous versions of Spark. Simply use Hadoop's FileSystem API to delete output directories by hand.</td>
592592
</tr>
593+
<tr>
594+
<td><code>spark.hadoop.cloneConf</code></td>
595+
<td>false</td>
596+
<td>If set to true, clones a new Hadoop <code>Configuration</code> object for each task. This
597+
option should be enabled to work around <code>Configuration</code> thread-safety issues (see
598+
<a href="https://issues.apache.org/jira/browse/SPARK-2546">SPARK-2546</a> for more details).
599+
This is disabled by default in order to avoid unexpected performance regressions for jobs that
600+
are not affected by these issues.</td>
601+
</tr>
593602
<tr>
594603
<td><code>spark.executor.heartbeatInterval</code></td>
595604
<td>10000</td>

0 commit comments

Comments
 (0)