Skip to content

[SPARK-4806] Streaming doc update for 1.2 #3653

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 22 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
192c7a7
Added more info about Python API, and rewrote the checkpointing section.
tdas Dec 8, 2014
e45453b
Update streaming guide, added deploying section.
tdas Dec 9, 2014
67fcffc
Added cluster mode + supervise example to submitting application guide.
tdas Dec 9, 2014
a0217c0
Changed Deploying menu layout
tdas Dec 9, 2014
17b99fb
Merge remote-tracking branch 'apache-github/master' into streaming-do…
tdas Dec 10, 2014
195852c
Updated based on Josh's comments, updated receiver reliability and de…
tdas Dec 10, 2014
aa8bb87
Small update.
tdas Dec 10, 2014
3019f3a
Fix minor Markdown formatting issues
JoshRosen Dec 10, 2014
f015397
Minor grammar / pluralization fixes.
JoshRosen Dec 10, 2014
65f66cd
Fix broken link to fault-tolerance semantics section.
JoshRosen Dec 10, 2014
a4ef126
Restructure parts of the fault-tolerance section to read a bit nicer …
JoshRosen Dec 10, 2014
b8c8382
minor fixes
JoshRosen Dec 10, 2014
b9c8c24
Merge pull request #26 from JoshRosen/streaming-programming-guide
tdas Dec 10, 2014
5707581
Added Pythn API badge
tdas Dec 11, 2014
91aa5aa
Improved API Docs menu
tdas Dec 11, 2014
2f3178c
Added more information about writing reliable receivers in the custom…
tdas Dec 11, 2014
2184729
Updated Kafka and Flume guides with reliability information.
tdas Dec 11, 2014
7787209
Merge branch 'streaming-doc-update-1.2' of github.com:tdas/spark into…
tdas Dec 11, 2014
f746951
Mentioned performance problem with WAL
tdas Dec 11, 2014
ca19078
Minor change
tdas Dec 11, 2014
ce299e4
Minor update.
tdas Dec 11, 2014
f53154a
Addressed Josh's comments.
tdas Dec 11, 2014
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 7 additions & 6 deletions docs/_layouts/global.html
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
<!-- Google analytics script -->
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-32518208-1']);
_gaq.push(['_setAccount', 'UA-32518208-2']);
_gaq.push(['_trackPageview']);

(function() {
Expand Down Expand Up @@ -79,9 +79,9 @@
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">API Docs<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="api/scala/index.html#org.apache.spark.package">Scaladoc</a></li>
<li><a href="api/java/index.html">Javadoc</a></li>
<li><a href="api/python/index.html">Python API</a></li>
<li><a href="api/scala/index.html#org.apache.spark.package">Scala</a></li>
<li><a href="api/java/index.html">Java</a></li>
<li><a href="api/python/index.html">Python</a></li>
</ul>
</li>

Expand All @@ -91,10 +91,11 @@
<li><a href="cluster-overview.html">Overview</a></li>
<li><a href="submitting-applications.html">Submitting Applications</a></li>
<li class="divider"></li>
<li><a href="ec2-scripts.html">Amazon EC2</a></li>
<li><a href="spark-standalone.html">Standalone Mode</a></li>
<li><a href="spark-standalone.html">Spark Standalone</a></li>
<li><a href="running-on-mesos.html">Mesos</a></li>
<li><a href="running-on-yarn.html">YARN</a></li>
<li class="divider"></li>
<li><a href="ec2-scripts.html">Amazon EC2</a></li>
</ul>
</li>

Expand Down
133 changes: 74 additions & 59 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ title: Spark Configuration
Spark provides three locations to configure the system:

* [Spark properties](#spark-properties) control most application parameters and can be set by using
a [SparkConf](api/core/index.html#org.apache.spark.SparkConf) object, or through Java
a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object, or through Java
system properties.
* [Environment variables](#environment-variables) can be used to set per-machine settings, such as
the IP address, through the `conf/spark-env.sh` script on each node.
Expand All @@ -23,8 +23,8 @@ application. These properties can be set directly on a
(e.g. master URL and application name), as well as arbitrary key-value pairs through the
`set()` method. For example, we could initialize an application with two threads as follows:

Note that we run with local[2], meaning two threads - which represents "minimal" parallelism,
which can help detect bugs that only exist when we run in a distributed context.
Note that we run with local[2], meaning two threads - which represents "minimal" parallelism,
which can help detect bugs that only exist when we run in a distributed context.

{% highlight scala %}
val conf = new SparkConf()
Expand All @@ -35,7 +35,7 @@ val sc = new SparkContext(conf)
{% endhighlight %}

Note that we can have more than 1 thread in local mode, and in cases like spark streaming, we may actually
require one to prevent any sort of starvation issues.
require one to prevent any sort of starvation issues.

## Dynamically Loading Spark Properties
In some cases, you may want to avoid hard-coding certain configurations in a `SparkConf`. For
Expand All @@ -48,8 +48,8 @@ val sc = new SparkContext(new SparkConf())

Then, you can supply configuration values at runtime:
{% highlight bash %}
./bin/spark-submit --name "My app" --master local[4] --conf spark.shuffle.spill=false
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
./bin/spark-submit --name "My app" --master local[4] --conf spark.shuffle.spill=false
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
{% endhighlight %}

The Spark shell and [`spark-submit`](submitting-applications.html)
Expand Down Expand Up @@ -123,7 +123,7 @@ of the most common options to set are:
<td>
Limit of total size of serialized results of all partitions for each Spark action (e.g. collect).
Should be at least 1M, or 0 for unlimited. Jobs will be aborted if the total size
is above this limit.
is above this limit.
Having a high limit may cause out-of-memory errors in driver (depends on spark.driver.memory
and memory overhead of objects in JVM). Setting a proper limit can protect the driver from
out-of-memory errors.
Expand Down Expand Up @@ -217,6 +217,45 @@ Apart from these, the following properties are also available, and may be useful
Set a special library path to use when launching executor JVM's.
</td>
</tr>
<tr>
<td><code>spark.executor.logs.rolling.strategy</code></td>
<td>(none)</td>
<td>
Set the strategy of rolling of executor logs. By default it is disabled. It can
be set to "time" (time-based rolling) or "size" (size-based rolling). For "time",
use <code>spark.executor.logs.rolling.time.interval</code> to set the rolling interval.
For "size", use <code>spark.executor.logs.rolling.size.maxBytes</code> to set
the maximum file size for rolling.
</td>
</tr>
<tr>
<td><code>spark.executor.logs.rolling.time.interval</code></td>
<td>daily</td>
<td>
Set the time interval by which the executor logs will be rolled over.
Rolling is disabled by default. Valid values are `daily`, `hourly`, `minutely` or
any interval in seconds. See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
for automatic cleaning of old logs.
</td>
</tr>
<tr>
<td><code>spark.executor.logs.rolling.size.maxBytes</code></td>
<td>(none)</td>
<td>
Set the max size of the file by which the executor logs will be rolled over.
Rolling is disabled by default. Value is set in terms of bytes.
See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
for automatic cleaning of old logs.
</td>
</tr>
<tr>
<td><code>spark.executor.logs.rolling.maxRetainedFiles</code></td>
<td>(none)</td>
<td>
Sets the number of latest rolling log files that are going to be retained by the system.
Older log files will be deleted. Disabled by default.
</td>
</tr>
<tr>
<td><code>spark.files.userClassPathFirst</code></td>
<td>false</td>
Expand Down Expand Up @@ -250,10 +289,11 @@ Apart from these, the following properties are also available, and may be useful
<td><code>spark.python.profile.dump</code></td>
<td>(none)</td>
<td>
The directory which is used to dump the profile result before driver exiting.
The directory which is used to dump the profile result before driver exiting.
The results will be dumped as separated file for each RDD. They can be loaded
by ptats.Stats(). If this is specified, the profile result will not be displayed
automatically.
</td>
</tr>
<tr>
<td><code>spark.python.worker.reuse</code></td>
Expand All @@ -269,8 +309,8 @@ Apart from these, the following properties are also available, and may be useful
<td><code>spark.executorEnv.[EnvironmentVariableName]</code></td>
<td>(none)</td>
<td>
Add the environment variable specified by <code>EnvironmentVariableName</code> to the Executor
process. The user can specify multiple of these and to set multiple environment variables.
Add the environment variable specified by <code>EnvironmentVariableName</code> to the Executor
process. The user can specify multiple of these and to set multiple environment variables.
</td>
</tr>
<tr>
Expand Down Expand Up @@ -475,9 +515,9 @@ Apart from these, the following properties are also available, and may be useful
<td>
The codec used to compress internal data such as RDD partitions, broadcast variables and
shuffle outputs. By default, Spark provides three codecs: <code>lz4</code>, <code>lzf</code>,
and <code>snappy</code>. You can also use fully qualified class names to specify the codec,
e.g.
<code>org.apache.spark.io.LZ4CompressionCodec</code>,
and <code>snappy</code>. You can also use fully qualified class names to specify the codec,
e.g.
<code>org.apache.spark.io.LZ4CompressionCodec</code>,
<code>org.apache.spark.io.LZFCompressionCodec</code>,
and <code>org.apache.spark.io.SnappyCompressionCodec</code>.
</td>
Expand Down Expand Up @@ -945,7 +985,7 @@ Apart from these, the following properties are also available, and may be useful
(resources are executors in yarn mode, CPU cores in standalone mode)
to wait for before scheduling begins. Specified as a double between 0.0 and 1.0.
Regardless of whether the minimum ratio of resources has been reached,
the maximum amount of time it will wait before scheduling begins is controlled by config
the maximum amount of time it will wait before scheduling begins is controlled by config
<code>spark.scheduler.maxRegisteredResourcesWaitingTime</code>.
</td>
</tr>
Expand All @@ -954,7 +994,7 @@ Apart from these, the following properties are also available, and may be useful
<td>30000</td>
<td>
Maximum amount of time to wait for resources to register before scheduling begins
(in milliseconds).
(in milliseconds).
</td>
</tr>
<tr>
Expand Down Expand Up @@ -1023,7 +1063,7 @@ Apart from these, the following properties are also available, and may be useful
<td>false</td>
<td>
Whether Spark acls should are enabled. If enabled, this checks to see if the user has
access permissions to view or modify the job. Note this requires the user to be known,
access permissions to view or modify the job. Note this requires the user to be known,
so if the user comes across as null no checks are done. Filters can be used with the UI
to authenticate and set the user.
</td>
Expand Down Expand Up @@ -1062,17 +1102,31 @@ Apart from these, the following properties are also available, and may be useful
<td><code>spark.streaming.blockInterval</code></td>
<td>200</td>
<td>
Interval (milliseconds) at which data received by Spark Streaming receivers is coalesced
into blocks of data before storing them in Spark.
Interval (milliseconds) at which data received by Spark Streaming receivers is chunked
into blocks of data before storing them in Spark. Minimum recommended - 50 ms. See the
<a href="streaming-programming-guide.html#level-of-parallelism-in-data-receiving">performance
tuning</a> section in the Spark Streaming programing guide for more details.
</td>
</tr>
<tr>
<td><code>spark.streaming.receiver.maxRate</code></td>
<td>infinite</td>
<td>
Maximum rate (per second) at which each receiver will push data into blocks. Effectively,
each stream will consume at most this number of records per second.
Maximum number records per second at which each receiver will receive data.
Effectively, each stream will consume at most this number of records per second.
Setting this configuration to 0 or a negative number will put no limit on the rate.
See the <a href="streaming-programming-guide.html#deploying-applications">deployment guide</a>
in the Spark Streaming programing guide for mode details.
</td>
</tr>
<tr>
<td><code>spark.streaming.receiver.writeAheadLogs.enable</code></td>
<td>false</td>
<td>
Enable write ahead logs for receivers. All the input data received through receivers
will be saved to write ahead logs that will allow it to be recovered after driver failures.
See the <a href="streaming-programming-guide.html#deploying-applications">deployment guide</a>
in the Spark Streaming programing guide for more details.
</td>
</tr>
<tr>
Expand All @@ -1086,45 +1140,6 @@ Apart from these, the following properties are also available, and may be useful
higher memory usage in Spark.
</td>
</tr>
<tr>
<td><code>spark.executor.logs.rolling.strategy</code></td>
<td>(none)</td>
<td>
Set the strategy of rolling of executor logs. By default it is disabled. It can
be set to "time" (time-based rolling) or "size" (size-based rolling). For "time",
use <code>spark.executor.logs.rolling.time.interval</code> to set the rolling interval.
For "size", use <code>spark.executor.logs.rolling.size.maxBytes</code> to set
the maximum file size for rolling.
</td>
</tr>
<tr>
<td><code>spark.executor.logs.rolling.time.interval</code></td>
<td>daily</td>
<td>
Set the time interval by which the executor logs will be rolled over.
Rolling is disabled by default. Valid values are `daily`, `hourly`, `minutely` or
any interval in seconds. See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
for automatic cleaning of old logs.
</td>
</tr>
<tr>
<td><code>spark.executor.logs.rolling.size.maxBytes</code></td>
<td>(none)</td>
<td>
Set the max size of the file by which the executor logs will be rolled over.
Rolling is disabled by default. Value is set in terms of bytes.
See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
for automatic cleaning of old logs.
</td>
</tr>
<tr>
<td><code>spark.executor.logs.rolling.maxRetainedFiles</code></td>
<td>(none)</td>
<td>
Sets the number of latest rolling log files that are going to be retained by the system.
Older log files will be deleted. Disabled by default.
</td>
</tr>
</table>

#### Cluster Managers
Expand Down
Loading