Skip to content

Commit 7787209

Browse files
committed
Merge branch 'streaming-doc-update-1.2' of github.com:tdas/spark into streaming-doc-update-1.2
Conflicts: docs/streaming-programming-guide.md
2 parents 2184729 + b9c8c24 commit 7787209

File tree

1 file changed

+35
-33
lines changed

1 file changed

+35
-33
lines changed

docs/streaming-programming-guide.md

Lines changed: 35 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -51,8 +51,9 @@ different languages.
5151
**Note:** Python API for Spark Streaming has been introduced in Spark 1.2. It has all the DStream
5252
transformations and almost all the output operations available in Scala and Java interfaces.
5353
However, it has only support for basic sources like text files and text data over sockets.
54-
API for creating more sources like Kafka, and Flume will be available in future.
55-
Further information about available features in Python API are mentioned throughout this
54+
<<<<<<< HEAD
55+
APIs for additional sources, like Kafka and Flume, will be available in the future.
56+
Further information about available features in the Python API are mentioned throughout this
5657
document; look out for the tag
5758
<span class="badge" style="background-color: grey">Python API</span>.
5859

@@ -622,7 +623,7 @@ as well as, to run the receiver(s).
622623
a input DStream based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single thread will
623624
be used to run the receiver, leaving no thread for processing the received data. Hence, when
624625
running locally, always use "local[*n*]" as the master URL where *n* > number of receivers to run
625-
(see [Spark Properties] (configuration.html#spark-properties.html for information on how to set
626+
(see [Spark Properties](configuration.html#spark-properties.html) for information on how to set
626627
the master).
627628

628629
- Extending the logic to running on a cluster, the number of cores allocated to the Spark Streaming
@@ -676,7 +677,7 @@ For more details on streams from sockets, files, and actors,
676677
see the API documentations of the relevant functions in
677678
[StreamingContext](api/scala/index.html#org.apache.spark.streaming.StreamingContext) for
678679
Scala, [JavaStreamingContext](api/java/index.html?org/apache/spark/streaming/api/java/JavaStreamingContext.html)
679-
for Java, and [StreamingContext].
680+
for Java, and [StreamingContext](api/python/pyspark.streaming.html#pyspark.streaming.StreamingContext) for Python.
680681

681682
### Advanced Sources
682683
{:.no_toc}
@@ -1517,7 +1518,7 @@ sliding interval of a DStream is good setting to try.
15171518
***
15181519

15191520
## Deploying Applications
1520-
This section discussed the steps to deploy a Spark Streaming applications.
1521+
This section discusses the steps to deploy a Spark Streaming application.
15211522

15221523
### Requirements
15231524
{:.no_toc}
@@ -1571,7 +1572,7 @@ To run a Spark Streaming applications, you need to have the following.
15711572
feature of write ahead logs. If enabled, all the data received from a receiver gets written into
15721573
a write ahead log in the configuration checkpoint directory. This prevents data loss on driver
15731574
recovery, thus allowing zero data loss guarantees which is discussed in detail in the
1574-
[Fault-tolerant Semantics](#fault-tolerant-semantics) section. Enable this by setting the
1575+
[Fault-tolerance Semantics](#fault-tolerance-semantics) section. Enable this by setting the
15751576
[configuration parameter](configuration.html#spark-streaming)
15761577
`spark.streaming.receiver.writeAheadLogs.enable` to `true`.
15771578

@@ -1617,7 +1618,7 @@ receivers are active, number of records received, receiver error, etc.)
16171618
and completed batches (batch processing times, queueing delays, etc.). This can be used to
16181619
monitor the progress of the streaming application.
16191620

1620-
The following two metrics in web UI are particularly important -
1621+
The following two metrics in web UI are particularly important:
16211622

16221623
- *Processing Time* - The time to process each batch of data.
16231624
- *Scheduling Delay* - the time a batch waits in a queue for the processing of previous batches
@@ -1710,12 +1711,12 @@ before further processing.
17101711
{:.no_toc}
17111712
Cluster resources can be under-utilized if the number of parallel tasks used in any stage of the
17121713
computation is not high enough. For example, for distributed reduce operations like `reduceByKey`
1713-
and `reduceByKeyAndWindow`, the default number of parallel tasks is decided by the
1714-
[config property](configuration.html#spark-properties) `spark.default.parallelism`.
1715-
You can pass the level of parallelism as an argument (see
1714+
and `reduceByKeyAndWindow`, the default number of parallel tasks is controlled by
1715+
the`spark.default.parallelism` [configuration property](configuration.html#spark-properties). You
1716+
can pass the level of parallelism as an argument (see
17161717
[`PairDStreamFunctions`](api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions)
1717-
documentation), or set the [config property](configuration.html#spark-properties)
1718-
`spark.default.parallelism` to change the default.
1718+
documentation), or set the `spark.default.parallelism`
1719+
[configuration property](configuration.html#spark-properties) to change the default.
17191720

17201721
### Data Serialization
17211722
{:.no_toc}
@@ -1811,72 +1812,73 @@ consistent batch processing times.
18111812
***************************************************************************************************
18121813

18131814
# Fault-tolerance Semantics
1814-
In this section, we will discuss the behavior of Spark Streaming application in the event
1815-
of a node failure. To understand this, let us remember the basic fault-tolerance semantics of
1815+
In this section, we will discuss the behavior of Spark Streaming applications in the event
1816+
of node failures. To understand this, let us remember the basic fault-tolerance semantics of
18161817
Spark's RDDs.
18171818

18181819
1. An RDD is an immutable, deterministically re-computable, distributed dataset. Each RDD
18191820
remembers the lineage of deterministic operations that were used on a fault-tolerant input
18201821
dataset to create it.
18211822
1. If any partition of an RDD is lost due to a worker node failure, then that partition can be
18221823
re-computed from the original fault-tolerant dataset using the lineage of operations.
1823-
1. Assuming all the RDD transformations are deterministic, the data in the final transformed RDD
1824-
will always be the same irrespective of failures in Spark cluster.
1824+
1. Assuming that all of the RDD transformations are deterministic, the data in the final transformed
1825+
RDD will always be the same irrespective of failures in the Spark cluster.
18251826

18261827
Spark operates on data on fault-tolerant file systems like HDFS or S3. Hence,
1827-
all the RDDs generated from the fault-tolerant data are also fault-tolerant. However, this is not
1828+
all of the RDDs generated from the fault-tolerant data are also fault-tolerant. However, this is not
18281829
the case for Spark Streaming as the data in most cases is received over the network (except when
1829-
`fileStream` is used). To achieve the same fault-tolerance properties for all the generated RDDs,
1830+
`fileStream` is used). To achieve the same fault-tolerance properties for all of the generated RDDs,
18301831
the received data is replicated among multiple Spark executors in worker nodes in the cluster
18311832
(default replication factor is 2). This leads to two kinds of data in the
1832-
system that needs to recovered in the event of a failure.
1833+
system that needs to recovered in the event of failures:
18331834

18341835
1. *Data received and replicated* - This data survives failure of a single worker node as a copy
18351836
of it exists on one of the nodes.
18361837
1. *Data received but buffered for replication* - Since this is not replicated,
18371838
the only way to recover that data is to get it again from the source.
18381839

1839-
Furthermore, there are two kinds of failures that we should be concerned about.
1840+
Furthermore, there are two kinds of failures that we should be concerned about:
18401841

1841-
1. *Failure of a Worker Node* - Any of the workers in the cluster can fail,
1842-
and all in-memory data on that node will be lost. If there are any receiver running on that
1843-
node, all buffered data will be lost.
1842+
1. *Failure of a Worker Node* - Any of the worker nodes running executors can fail,
1843+
and all in-memory data on those nodes will be lost. If any receivers were running on failed
1844+
nodes, then their buffered data will be lost.
18441845
1. *Failure of the Driver Node* - If the driver node running the Spark Streaming application
1845-
fails, then obviously the SparkContext is lost, as well as all executors with their in-memory
1846+
fails, then obviously the SparkContext is lost, and all executors with their in-memory
18461847
data are lost.
18471848

18481849
With this basic knowledge, let us understand the fault-tolerance semantics of Spark Streaming.
18491850

18501851
## Semantics with files as input source
18511852
{:.no_toc}
1852-
In this case, since all the input data is already present in a fault-tolerant files system like
1853+
If all of the input data is already present in a fault-tolerant files system like
18531854
HDFS, Spark Streaming can always recover from any failure and process all the data. This gives
18541855
*exactly-once* semantics, that all the data will be processed exactly once no matter what fails.
18551856

18561857
## Semantics with input sources based on receivers
18571858
{:.no_toc}
1858-
Here we will first discuss the semantics in the context of different types of failures. As we
1859-
discussed [earlier](#receiver-reliability), there are two kinds of receivers.
1859+
For input sources based on receivers, the fault-tolerance semantics depend on both the failure
1860+
scenario and the type of receiver.
1861+
As we discussed [earlier](#receiver-reliability), there are two types of receivers:
18601862

18611863
1. *Reliable Receiver* - These receivers acknowledge reliable sources only after ensuring that
18621864
the received data has been replicated. If such a receiver fails,
18631865
the buffered (unreplicated) data does not get acknowledged to the source. If the receiver is
1864-
restarted, the source would resend the data, and so no data will be lost due to the failure.
1866+
restarted, the source will resend the data, and therefore no data will be lost due to the failure.
18651867
1. *Unreliable Receiver* - Such receivers can lose data when they fail due to worker
18661868
or driver failures.
18671869

18681870
Depending on what type of receivers are used we achieve the following semantics.
18691871
If a worker node fails, then there is no data loss with reliable receivers. With unreliable
18701872
receivers, data received but not replicated can get lost. If the driver node fails,
1871-
then besides these losses, all the past data that were received and replicated in memory will be
1873+
then besides these losses, all the past data that was received and replicated in memory will be
18721874
lost. This will affect the results of the stateful transformations.
18731875

1874-
To avoid this loss of past received data, Spark 1.2 introduces an experimental feature of write
1875-
ahead logs, that saves the received data to a fault-tolerant storage. With the [write ahead logs
1876+
To avoid this loss of past received data, Spark 1.2 introduces an experimental feature of _write
1877+
ahead logs_ which saves the received data to fault-tolerant storage. With the [write ahead logs
18761878
enabled](#deploying-applications) and reliable receivers, there is zero data loss and
18771879
exactly-once semantics.
18781880

1879-
The following table summarizes the semantics under failures.
1881+
The following table summarizes the semantics under failures:
18801882

18811883
<table class="table">
18821884
<tr>
@@ -2006,5 +2008,5 @@ package and renamed for better clarity.
20062008

20072009
* More examples in [Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples/streaming)
20082010
and [Java]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples/streaming)
2009-
and [Python] ({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python/streaming)
2011+
and [Python]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python/streaming)
20102012
* [Paper](http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf) and [video](http://youtu.be/g171ndOHgJ0) describing Spark Streaming.

0 commit comments

Comments
 (0)