@@ -8,7 +8,7 @@ title: Spark Configuration
8
8
Spark provides three locations to configure the system:
9
9
10
10
* [ Spark properties] ( #spark-properties ) control most application parameters and can be set by using
11
- a [ SparkConf] ( api/core /index.html#org.apache.spark.SparkConf ) object, or through Java
11
+ a [ SparkConf] ( api/scala /index.html#org.apache.spark.SparkConf ) object, or through Java
12
12
system properties.
13
13
* [ Environment variables] ( #environment-variables ) can be used to set per-machine settings, such as
14
14
the IP address, through the ` conf/spark-env.sh ` script on each node.
@@ -23,8 +23,8 @@ application. These properties can be set directly on a
23
23
(e.g. master URL and application name), as well as arbitrary key-value pairs through the
24
24
` set() ` method. For example, we could initialize an application with two threads as follows:
25
25
26
- Note that we run with local[ 2] , meaning two threads - which represents "minimal" parallelism,
27
- which can help detect bugs that only exist when we run in a distributed context.
26
+ Note that we run with local[ 2] , meaning two threads - which represents "minimal" parallelism,
27
+ which can help detect bugs that only exist when we run in a distributed context.
28
28
29
29
{% highlight scala %}
30
30
val conf = new SparkConf()
@@ -35,7 +35,7 @@ val sc = new SparkContext(conf)
35
35
{% endhighlight %}
36
36
37
37
Note that we can have more than 1 thread in local mode, and in cases like spark streaming, we may actually
38
- require one to prevent any sort of starvation issues.
38
+ require one to prevent any sort of starvation issues.
39
39
40
40
## Dynamically Loading Spark Properties
41
41
In some cases, you may want to avoid hard-coding certain configurations in a ` SparkConf ` . For
@@ -48,8 +48,8 @@ val sc = new SparkContext(new SparkConf())
48
48
49
49
Then, you can supply configuration values at runtime:
50
50
{% highlight bash %}
51
- ./bin/spark-submit --name "My app" --master local[ 4] --conf spark.shuffle.spill=false
52
- --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
51
+ ./bin/spark-submit --name "My app" --master local[ 4] --conf spark.shuffle.spill=false
52
+ --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
53
53
{% endhighlight %}
54
54
55
55
The Spark shell and [ ` spark-submit ` ] ( submitting-applications.html )
@@ -123,7 +123,7 @@ of the most common options to set are:
123
123
<td >
124
124
Limit of total size of serialized results of all partitions for each Spark action (e.g. collect).
125
125
Should be at least 1M, or 0 for unlimited. Jobs will be aborted if the total size
126
- is above this limit.
126
+ is above this limit.
127
127
Having a high limit may cause out-of-memory errors in driver (depends on spark.driver.memory
128
128
and memory overhead of objects in JVM). Setting a proper limit can protect the driver from
129
129
out-of-memory errors.
@@ -217,6 +217,45 @@ Apart from these, the following properties are also available, and may be useful
217
217
Set a special library path to use when launching executor JVM's.
218
218
</td >
219
219
</tr >
220
+ <tr >
221
+ <td ><code >spark.executor.logs.rolling.strategy</code ></td >
222
+ <td >(none)</td >
223
+ <td >
224
+ Set the strategy of rolling of executor logs. By default it is disabled. It can
225
+ be set to "time" (time-based rolling) or "size" (size-based rolling). For "time",
226
+ use <code>spark.executor.logs.rolling.time.interval</code> to set the rolling interval.
227
+ For "size", use <code>spark.executor.logs.rolling.size.maxBytes</code> to set
228
+ the maximum file size for rolling.
229
+ </td >
230
+ </tr >
231
+ <tr >
232
+ <td ><code >spark.executor.logs.rolling.time.interval</code ></td >
233
+ <td >daily</td >
234
+ <td >
235
+ Set the time interval by which the executor logs will be rolled over.
236
+ Rolling is disabled by default. Valid values are `daily`, `hourly`, `minutely` or
237
+ any interval in seconds. See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
238
+ for automatic cleaning of old logs.
239
+ </td >
240
+ </tr >
241
+ <tr >
242
+ <td ><code >spark.executor.logs.rolling.size.maxBytes</code ></td >
243
+ <td >(none)</td >
244
+ <td >
245
+ Set the max size of the file by which the executor logs will be rolled over.
246
+ Rolling is disabled by default. Value is set in terms of bytes.
247
+ See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
248
+ for automatic cleaning of old logs.
249
+ </td >
250
+ </tr >
251
+ <tr >
252
+ <td ><code >spark.executor.logs.rolling.maxRetainedFiles</code ></td >
253
+ <td >(none)</td >
254
+ <td >
255
+ Sets the number of latest rolling log files that are going to be retained by the system.
256
+ Older log files will be deleted. Disabled by default.
257
+ </td >
258
+ </tr >
220
259
<tr >
221
260
<td ><code >spark.files.userClassPathFirst</code ></td >
222
261
<td >false</td >
@@ -250,10 +289,11 @@ Apart from these, the following properties are also available, and may be useful
250
289
<td ><code >spark.python.profile.dump</code ></td >
251
290
<td >(none)</td >
252
291
<td >
253
- The directory which is used to dump the profile result before driver exiting.
292
+ The directory which is used to dump the profile result before driver exiting.
254
293
The results will be dumped as separated file for each RDD. They can be loaded
255
294
by ptats.Stats(). If this is specified, the profile result will not be displayed
256
295
automatically.
296
+ </td >
257
297
</tr >
258
298
<tr >
259
299
<td ><code >spark.python.worker.reuse</code ></td >
@@ -269,8 +309,8 @@ Apart from these, the following properties are also available, and may be useful
269
309
<td ><code >spark.executorEnv.[EnvironmentVariableName]</code ></td >
270
310
<td >(none)</td >
271
311
<td >
272
- Add the environment variable specified by <code>EnvironmentVariableName</code> to the Executor
273
- process. The user can specify multiple of these and to set multiple environment variables.
312
+ Add the environment variable specified by <code>EnvironmentVariableName</code> to the Executor
313
+ process. The user can specify multiple of these and to set multiple environment variables.
274
314
</td >
275
315
</tr >
276
316
<tr >
@@ -475,9 +515,9 @@ Apart from these, the following properties are also available, and may be useful
475
515
<td >
476
516
The codec used to compress internal data such as RDD partitions, broadcast variables and
477
517
shuffle outputs. By default, Spark provides three codecs: <code>lz4</code>, <code>lzf</code>,
478
- and <code>snappy</code>. You can also use fully qualified class names to specify the codec,
479
- e.g.
480
- <code>org.apache.spark.io.LZ4CompressionCodec</code>,
518
+ and <code>snappy</code>. You can also use fully qualified class names to specify the codec,
519
+ e.g.
520
+ <code>org.apache.spark.io.LZ4CompressionCodec</code>,
481
521
<code>org.apache.spark.io.LZFCompressionCodec</code>,
482
522
and <code>org.apache.spark.io.SnappyCompressionCodec</code>.
483
523
</td >
@@ -945,7 +985,7 @@ Apart from these, the following properties are also available, and may be useful
945
985
(resources are executors in yarn mode, CPU cores in standalone mode)
946
986
to wait for before scheduling begins. Specified as a double between 0.0 and 1.0.
947
987
Regardless of whether the minimum ratio of resources has been reached,
948
- the maximum amount of time it will wait before scheduling begins is controlled by config
988
+ the maximum amount of time it will wait before scheduling begins is controlled by config
949
989
<code>spark.scheduler.maxRegisteredResourcesWaitingTime</code>.
950
990
</td >
951
991
</tr >
@@ -954,7 +994,7 @@ Apart from these, the following properties are also available, and may be useful
954
994
<td >30000</td >
955
995
<td >
956
996
Maximum amount of time to wait for resources to register before scheduling begins
957
- (in milliseconds).
997
+ (in milliseconds).
958
998
</td >
959
999
</tr >
960
1000
<tr >
@@ -1023,7 +1063,7 @@ Apart from these, the following properties are also available, and may be useful
1023
1063
<td >false</td >
1024
1064
<td >
1025
1065
Whether Spark acls should are enabled. If enabled, this checks to see if the user has
1026
- access permissions to view or modify the job. Note this requires the user to be known,
1066
+ access permissions to view or modify the job. Note this requires the user to be known,
1027
1067
so if the user comes across as null no checks are done. Filters can be used with the UI
1028
1068
to authenticate and set the user.
1029
1069
</td >
@@ -1062,17 +1102,31 @@ Apart from these, the following properties are also available, and may be useful
1062
1102
<td ><code >spark.streaming.blockInterval</code ></td >
1063
1103
<td >200</td >
1064
1104
<td >
1065
- Interval (milliseconds) at which data received by Spark Streaming receivers is coalesced
1066
- into blocks of data before storing them in Spark.
1105
+ Interval (milliseconds) at which data received by Spark Streaming receivers is chunked
1106
+ into blocks of data before storing them in Spark. Minimum recommended - 50 ms. See the
1107
+ <a href="streaming-programming-guide.html#level-of-parallelism-in-data-receiving">performance
1108
+ tuning</a> section in the Spark Streaming programing guide for more details.
1067
1109
</td >
1068
1110
</tr >
1069
1111
<tr >
1070
1112
<td ><code >spark.streaming.receiver.maxRate</code ></td >
1071
1113
<td >infinite</td >
1072
1114
<td >
1073
- Maximum rate ( per second) at which each receiver will push data into blocks. Effectively,
1074
- each stream will consume at most this number of records per second.
1115
+ Maximum number records per second at which each receiver will receive data.
1116
+ Effectively, each stream will consume at most this number of records per second.
1075
1117
Setting this configuration to 0 or a negative number will put no limit on the rate.
1118
+ See the <a href="streaming-programming-guide.html#deploying-applications">deployment guide</a>
1119
+ in the Spark Streaming programing guide for mode details.
1120
+ </td >
1121
+ </tr >
1122
+ <tr >
1123
+ <td ><code >spark.streaming.receiver.writeAheadLogs.enable</code ></td >
1124
+ <td >false</td >
1125
+ <td >
1126
+ Enable write ahead logs for receivers. All the input data received through receivers
1127
+ will be saved to write ahead logs that will allow it to be recovered after driver failures.
1128
+ See the <a href="streaming-programming-guide.html#deploying-applications">deployment guide</a>
1129
+ in the Spark Streaming programing guide for more details.
1076
1130
</td >
1077
1131
</tr >
1078
1132
<tr >
@@ -1086,45 +1140,6 @@ Apart from these, the following properties are also available, and may be useful
1086
1140
higher memory usage in Spark.
1087
1141
</td >
1088
1142
</tr >
1089
- <tr >
1090
- <td ><code >spark.executor.logs.rolling.strategy</code ></td >
1091
- <td >(none)</td >
1092
- <td >
1093
- Set the strategy of rolling of executor logs. By default it is disabled. It can
1094
- be set to "time" (time-based rolling) or "size" (size-based rolling). For "time",
1095
- use <code>spark.executor.logs.rolling.time.interval</code> to set the rolling interval.
1096
- For "size", use <code>spark.executor.logs.rolling.size.maxBytes</code> to set
1097
- the maximum file size for rolling.
1098
- </td >
1099
- </tr >
1100
- <tr >
1101
- <td ><code >spark.executor.logs.rolling.time.interval</code ></td >
1102
- <td >daily</td >
1103
- <td >
1104
- Set the time interval by which the executor logs will be rolled over.
1105
- Rolling is disabled by default. Valid values are `daily`, `hourly`, `minutely` or
1106
- any interval in seconds. See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
1107
- for automatic cleaning of old logs.
1108
- </td >
1109
- </tr >
1110
- <tr >
1111
- <td ><code >spark.executor.logs.rolling.size.maxBytes</code ></td >
1112
- <td >(none)</td >
1113
- <td >
1114
- Set the max size of the file by which the executor logs will be rolled over.
1115
- Rolling is disabled by default. Value is set in terms of bytes.
1116
- See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
1117
- for automatic cleaning of old logs.
1118
- </td >
1119
- </tr >
1120
- <tr >
1121
- <td ><code >spark.executor.logs.rolling.maxRetainedFiles</code ></td >
1122
- <td >(none)</td >
1123
- <td >
1124
- Sets the number of latest rolling log files that are going to be retained by the system.
1125
- Older log files will be deleted. Disabled by default.
1126
- </td >
1127
- </tr >
1128
1143
</table >
1129
1144
1130
1145
#### Cluster Managers
0 commit comments