Skip to content

Commit b004150

Browse files
committed
[SPARK-4806] Streaming doc update for 1.2
Important updates to the streaming programming guide - Make the fault-tolerance properties easier to understand, with information about write ahead logs - Update the information about deploying the spark streaming app with information about Driver HA - Update Receiver guide to discuss reliable vs unreliable receivers. Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Josh Rosen <joshrosen@databricks.com> Author: Josh Rosen <rosenville@gmail.com> Closes #3653 from tdas/streaming-doc-update-1.2 and squashes the following commits: f53154a [Tathagata Das] Addressed Josh's comments. ce299e4 [Tathagata Das] Minor update. ca19078 [Tathagata Das] Minor change f746951 [Tathagata Das] Mentioned performance problem with WAL 7787209 [Tathagata Das] Merge branch 'streaming-doc-update-1.2' of github.com:tdas/spark into streaming-doc-update-1.2 2184729 [Tathagata Das] Updated Kafka and Flume guides with reliability information. 2f3178c [Tathagata Das] Added more information about writing reliable receivers in the custom receiver guide. 91aa5aa [Tathagata Das] Improved API Docs menu 5707581 [Tathagata Das] Added Pythn API badge b9c8c24 [Tathagata Das] Merge pull request #26 from JoshRosen/streaming-programming-guide b8c8382 [Josh Rosen] minor fixes a4ef126 [Josh Rosen] Restructure parts of the fault-tolerance section to read a bit nicer when skipping over the headings 65f66cd [Josh Rosen] Fix broken link to fault-tolerance semantics section. f015397 [Josh Rosen] Minor grammar / pluralization fixes. 3019f3a [Josh Rosen] Fix minor Markdown formatting issues aa8bb87 [Tathagata Das] Small update. 195852c [Tathagata Das] Updated based on Josh's comments, updated receiver reliability and deploying section, and also updated configuration. 17b99fb [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into streaming-doc-update-1.2 a0217c0 [Tathagata Das] Changed Deploying menu layout 67fcffc [Tathagata Das] Added cluster mode + supervise example to submitting application guide. e45453b [Tathagata Das] Update streaming guide, added deploying section. 192c7a7 [Tathagata Das] Added more info about Python API, and rewrote the checkpointing section.
1 parent 2a5b5fd commit b004150

7 files changed

+819
-551
lines changed

docs/_layouts/global.html

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@
3333
<!-- Google analytics script -->
3434
<script type="text/javascript">
3535
var _gaq = _gaq || [];
36-
_gaq.push(['_setAccount', 'UA-32518208-1']);
36+
_gaq.push(['_setAccount', 'UA-32518208-2']);
3737
_gaq.push(['_trackPageview']);
3838

3939
(function() {
@@ -79,9 +79,9 @@
7979
<li class="dropdown">
8080
<a href="#" class="dropdown-toggle" data-toggle="dropdown">API Docs<b class="caret"></b></a>
8181
<ul class="dropdown-menu">
82-
<li><a href="api/scala/index.html#org.apache.spark.package">Scaladoc</a></li>
83-
<li><a href="api/java/index.html">Javadoc</a></li>
84-
<li><a href="api/python/index.html">Python API</a></li>
82+
<li><a href="api/scala/index.html#org.apache.spark.package">Scala</a></li>
83+
<li><a href="api/java/index.html">Java</a></li>
84+
<li><a href="api/python/index.html">Python</a></li>
8585
</ul>
8686
</li>
8787

@@ -91,10 +91,11 @@
9191
<li><a href="cluster-overview.html">Overview</a></li>
9292
<li><a href="submitting-applications.html">Submitting Applications</a></li>
9393
<li class="divider"></li>
94-
<li><a href="ec2-scripts.html">Amazon EC2</a></li>
95-
<li><a href="spark-standalone.html">Standalone Mode</a></li>
94+
<li><a href="spark-standalone.html">Spark Standalone</a></li>
9695
<li><a href="running-on-mesos.html">Mesos</a></li>
9796
<li><a href="running-on-yarn.html">YARN</a></li>
97+
<li class="divider"></li>
98+
<li><a href="ec2-scripts.html">Amazon EC2</a></li>
9899
</ul>
99100
</li>
100101

docs/configuration.md

Lines changed: 74 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ title: Spark Configuration
88
Spark provides three locations to configure the system:
99

1010
* [Spark properties](#spark-properties) control most application parameters and can be set by using
11-
a [SparkConf](api/core/index.html#org.apache.spark.SparkConf) object, or through Java
11+
a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object, or through Java
1212
system properties.
1313
* [Environment variables](#environment-variables) can be used to set per-machine settings, such as
1414
the IP address, through the `conf/spark-env.sh` script on each node.
@@ -23,8 +23,8 @@ application. These properties can be set directly on a
2323
(e.g. master URL and application name), as well as arbitrary key-value pairs through the
2424
`set()` method. For example, we could initialize an application with two threads as follows:
2525

26-
Note that we run with local[2], meaning two threads - which represents "minimal" parallelism,
27-
which can help detect bugs that only exist when we run in a distributed context.
26+
Note that we run with local[2], meaning two threads - which represents "minimal" parallelism,
27+
which can help detect bugs that only exist when we run in a distributed context.
2828

2929
{% highlight scala %}
3030
val conf = new SparkConf()
@@ -35,7 +35,7 @@ val sc = new SparkContext(conf)
3535
{% endhighlight %}
3636

3737
Note that we can have more than 1 thread in local mode, and in cases like spark streaming, we may actually
38-
require one to prevent any sort of starvation issues.
38+
require one to prevent any sort of starvation issues.
3939

4040
## Dynamically Loading Spark Properties
4141
In some cases, you may want to avoid hard-coding certain configurations in a `SparkConf`. For
@@ -48,8 +48,8 @@ val sc = new SparkContext(new SparkConf())
4848

4949
Then, you can supply configuration values at runtime:
5050
{% highlight bash %}
51-
./bin/spark-submit --name "My app" --master local[4] --conf spark.shuffle.spill=false
52-
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
51+
./bin/spark-submit --name "My app" --master local[4] --conf spark.shuffle.spill=false
52+
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
5353
{% endhighlight %}
5454

5555
The Spark shell and [`spark-submit`](submitting-applications.html)
@@ -123,7 +123,7 @@ of the most common options to set are:
123123
<td>
124124
Limit of total size of serialized results of all partitions for each Spark action (e.g. collect).
125125
Should be at least 1M, or 0 for unlimited. Jobs will be aborted if the total size
126-
is above this limit.
126+
is above this limit.
127127
Having a high limit may cause out-of-memory errors in driver (depends on spark.driver.memory
128128
and memory overhead of objects in JVM). Setting a proper limit can protect the driver from
129129
out-of-memory errors.
@@ -217,6 +217,45 @@ Apart from these, the following properties are also available, and may be useful
217217
Set a special library path to use when launching executor JVM's.
218218
</td>
219219
</tr>
220+
<tr>
221+
<td><code>spark.executor.logs.rolling.strategy</code></td>
222+
<td>(none)</td>
223+
<td>
224+
Set the strategy of rolling of executor logs. By default it is disabled. It can
225+
be set to "time" (time-based rolling) or "size" (size-based rolling). For "time",
226+
use <code>spark.executor.logs.rolling.time.interval</code> to set the rolling interval.
227+
For "size", use <code>spark.executor.logs.rolling.size.maxBytes</code> to set
228+
the maximum file size for rolling.
229+
</td>
230+
</tr>
231+
<tr>
232+
<td><code>spark.executor.logs.rolling.time.interval</code></td>
233+
<td>daily</td>
234+
<td>
235+
Set the time interval by which the executor logs will be rolled over.
236+
Rolling is disabled by default. Valid values are `daily`, `hourly`, `minutely` or
237+
any interval in seconds. See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
238+
for automatic cleaning of old logs.
239+
</td>
240+
</tr>
241+
<tr>
242+
<td><code>spark.executor.logs.rolling.size.maxBytes</code></td>
243+
<td>(none)</td>
244+
<td>
245+
Set the max size of the file by which the executor logs will be rolled over.
246+
Rolling is disabled by default. Value is set in terms of bytes.
247+
See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
248+
for automatic cleaning of old logs.
249+
</td>
250+
</tr>
251+
<tr>
252+
<td><code>spark.executor.logs.rolling.maxRetainedFiles</code></td>
253+
<td>(none)</td>
254+
<td>
255+
Sets the number of latest rolling log files that are going to be retained by the system.
256+
Older log files will be deleted. Disabled by default.
257+
</td>
258+
</tr>
220259
<tr>
221260
<td><code>spark.files.userClassPathFirst</code></td>
222261
<td>false</td>
@@ -250,10 +289,11 @@ Apart from these, the following properties are also available, and may be useful
250289
<td><code>spark.python.profile.dump</code></td>
251290
<td>(none)</td>
252291
<td>
253-
The directory which is used to dump the profile result before driver exiting.
292+
The directory which is used to dump the profile result before driver exiting.
254293
The results will be dumped as separated file for each RDD. They can be loaded
255294
by ptats.Stats(). If this is specified, the profile result will not be displayed
256295
automatically.
296+
</td>
257297
</tr>
258298
<tr>
259299
<td><code>spark.python.worker.reuse</code></td>
@@ -269,8 +309,8 @@ Apart from these, the following properties are also available, and may be useful
269309
<td><code>spark.executorEnv.[EnvironmentVariableName]</code></td>
270310
<td>(none)</td>
271311
<td>
272-
Add the environment variable specified by <code>EnvironmentVariableName</code> to the Executor
273-
process. The user can specify multiple of these and to set multiple environment variables.
312+
Add the environment variable specified by <code>EnvironmentVariableName</code> to the Executor
313+
process. The user can specify multiple of these and to set multiple environment variables.
274314
</td>
275315
</tr>
276316
<tr>
@@ -475,9 +515,9 @@ Apart from these, the following properties are also available, and may be useful
475515
<td>
476516
The codec used to compress internal data such as RDD partitions, broadcast variables and
477517
shuffle outputs. By default, Spark provides three codecs: <code>lz4</code>, <code>lzf</code>,
478-
and <code>snappy</code>. You can also use fully qualified class names to specify the codec,
479-
e.g.
480-
<code>org.apache.spark.io.LZ4CompressionCodec</code>,
518+
and <code>snappy</code>. You can also use fully qualified class names to specify the codec,
519+
e.g.
520+
<code>org.apache.spark.io.LZ4CompressionCodec</code>,
481521
<code>org.apache.spark.io.LZFCompressionCodec</code>,
482522
and <code>org.apache.spark.io.SnappyCompressionCodec</code>.
483523
</td>
@@ -945,7 +985,7 @@ Apart from these, the following properties are also available, and may be useful
945985
(resources are executors in yarn mode, CPU cores in standalone mode)
946986
to wait for before scheduling begins. Specified as a double between 0.0 and 1.0.
947987
Regardless of whether the minimum ratio of resources has been reached,
948-
the maximum amount of time it will wait before scheduling begins is controlled by config
988+
the maximum amount of time it will wait before scheduling begins is controlled by config
949989
<code>spark.scheduler.maxRegisteredResourcesWaitingTime</code>.
950990
</td>
951991
</tr>
@@ -954,7 +994,7 @@ Apart from these, the following properties are also available, and may be useful
954994
<td>30000</td>
955995
<td>
956996
Maximum amount of time to wait for resources to register before scheduling begins
957-
(in milliseconds).
997+
(in milliseconds).
958998
</td>
959999
</tr>
9601000
<tr>
@@ -1023,7 +1063,7 @@ Apart from these, the following properties are also available, and may be useful
10231063
<td>false</td>
10241064
<td>
10251065
Whether Spark acls should are enabled. If enabled, this checks to see if the user has
1026-
access permissions to view or modify the job. Note this requires the user to be known,
1066+
access permissions to view or modify the job. Note this requires the user to be known,
10271067
so if the user comes across as null no checks are done. Filters can be used with the UI
10281068
to authenticate and set the user.
10291069
</td>
@@ -1062,17 +1102,31 @@ Apart from these, the following properties are also available, and may be useful
10621102
<td><code>spark.streaming.blockInterval</code></td>
10631103
<td>200</td>
10641104
<td>
1065-
Interval (milliseconds) at which data received by Spark Streaming receivers is coalesced
1066-
into blocks of data before storing them in Spark.
1105+
Interval (milliseconds) at which data received by Spark Streaming receivers is chunked
1106+
into blocks of data before storing them in Spark. Minimum recommended - 50 ms. See the
1107+
<a href="streaming-programming-guide.html#level-of-parallelism-in-data-receiving">performance
1108+
tuning</a> section in the Spark Streaming programing guide for more details.
10671109
</td>
10681110
</tr>
10691111
<tr>
10701112
<td><code>spark.streaming.receiver.maxRate</code></td>
10711113
<td>infinite</td>
10721114
<td>
1073-
Maximum rate (per second) at which each receiver will push data into blocks. Effectively,
1074-
each stream will consume at most this number of records per second.
1115+
Maximum number records per second at which each receiver will receive data.
1116+
Effectively, each stream will consume at most this number of records per second.
10751117
Setting this configuration to 0 or a negative number will put no limit on the rate.
1118+
See the <a href="streaming-programming-guide.html#deploying-applications">deployment guide</a>
1119+
in the Spark Streaming programing guide for mode details.
1120+
</td>
1121+
</tr>
1122+
<tr>
1123+
<td><code>spark.streaming.receiver.writeAheadLogs.enable</code></td>
1124+
<td>false</td>
1125+
<td>
1126+
Enable write ahead logs for receivers. All the input data received through receivers
1127+
will be saved to write ahead logs that will allow it to be recovered after driver failures.
1128+
See the <a href="streaming-programming-guide.html#deploying-applications">deployment guide</a>
1129+
in the Spark Streaming programing guide for more details.
10761130
</td>
10771131
</tr>
10781132
<tr>
@@ -1086,45 +1140,6 @@ Apart from these, the following properties are also available, and may be useful
10861140
higher memory usage in Spark.
10871141
</td>
10881142
</tr>
1089-
<tr>
1090-
<td><code>spark.executor.logs.rolling.strategy</code></td>
1091-
<td>(none)</td>
1092-
<td>
1093-
Set the strategy of rolling of executor logs. By default it is disabled. It can
1094-
be set to "time" (time-based rolling) or "size" (size-based rolling). For "time",
1095-
use <code>spark.executor.logs.rolling.time.interval</code> to set the rolling interval.
1096-
For "size", use <code>spark.executor.logs.rolling.size.maxBytes</code> to set
1097-
the maximum file size for rolling.
1098-
</td>
1099-
</tr>
1100-
<tr>
1101-
<td><code>spark.executor.logs.rolling.time.interval</code></td>
1102-
<td>daily</td>
1103-
<td>
1104-
Set the time interval by which the executor logs will be rolled over.
1105-
Rolling is disabled by default. Valid values are `daily`, `hourly`, `minutely` or
1106-
any interval in seconds. See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
1107-
for automatic cleaning of old logs.
1108-
</td>
1109-
</tr>
1110-
<tr>
1111-
<td><code>spark.executor.logs.rolling.size.maxBytes</code></td>
1112-
<td>(none)</td>
1113-
<td>
1114-
Set the max size of the file by which the executor logs will be rolled over.
1115-
Rolling is disabled by default. Value is set in terms of bytes.
1116-
See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
1117-
for automatic cleaning of old logs.
1118-
</td>
1119-
</tr>
1120-
<tr>
1121-
<td><code>spark.executor.logs.rolling.maxRetainedFiles</code></td>
1122-
<td>(none)</td>
1123-
<td>
1124-
Sets the number of latest rolling log files that are going to be retained by the system.
1125-
Older log files will be deleted. Disabled by default.
1126-
</td>
1127-
</tr>
11281143
</table>
11291144

11301145
#### Cluster Managers

0 commit comments

Comments
 (0)