Skip to content

Commit b9c8c24

Browse files
committed
Merge pull request #26 from JoshRosen/streaming-programming-guide
Minor edits in Streaming Programming Guide
2 parents aa8bb87 + b8c8382 commit b9c8c24

File tree

1 file changed

+36
-35
lines changed

1 file changed

+36
-35
lines changed

docs/streaming-programming-guide.md

Lines changed: 36 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -51,8 +51,8 @@ different languages.
5151
**Note:** *Python API has been introduced in Spark 1.2. It has all the DStream transformations
5252
and almost all the output operations available in Scala and Java interfaces.
5353
However, it has only support for basic sources like text files and text data over sockets.
54-
API for creating more sources like Kafka, and Flume will be available in future.
55-
Further information about available features in Python API are mentioned throughout this
54+
APIs for additional sources, like Kafka and Flume, will be available in the future.
55+
Further information about available features in the Python API is mentioned throughout this
5656
document; look out for the tag* "**Note on Python API**".
5757

5858
***************************************************************************************************
@@ -622,7 +622,7 @@ as well as, to run the receiver(s).
622622
a input DStream based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single thread will
623623
be used to run the receiver, leaving no thread for processing the received data. Hence, when
624624
running locally, always use "local[*n*]" as the master URL where *n* > number of receivers to run
625-
(see [Spark Properties] (configuration.html#spark-properties.html for information on how to set
625+
(see [Spark Properties](configuration.html#spark-properties.html) for information on how to set
626626
the master).
627627

628628
- Extending the logic to running on a cluster, the number of cores allocated to the Spark Streaming
@@ -667,7 +667,7 @@ methods for creating DStreams from files and Akka actors as input sources.
667667
Guide](streaming-custom-receivers.html#implementing-and-using-a-custom-actor-based-receiver) for
668668
more details.
669669

670-
*Note on Python API:** Since actors are available only in the Java and Scala
670+
**Note on Python API:** Since actors are available only in the Java and Scala
671671
libraries, `actorStream` is not available in the Python API.
672672

673673
- **Queue of RDDs as a Stream:** For testing a Spark Streaming application with test data, one can also create a DStream based on a queue of RDDs, using `streamingContext.queueStream(queueOfRDDs)`. Each RDD pushed into the queue will be treated as a batch of data in the DStream, and processed like a stream.
@@ -676,7 +676,7 @@ For more details on streams from sockets, files, and actors,
676676
see the API documentations of the relevant functions in
677677
[StreamingContext](api/scala/index.html#org.apache.spark.streaming.StreamingContext) for
678678
Scala, [JavaStreamingContext](api/java/index.html?org/apache/spark/streaming/api/java/JavaStreamingContext.html)
679-
for Java, and [StreamingContext].
679+
for Java, and [StreamingContext](api/python/pyspark.streaming.html#pyspark.streaming.StreamingContext) for Python.
680680

681681
### Advanced Sources
682682
{:.no_toc}
@@ -1506,7 +1506,7 @@ sliding interval of a DStream is good setting to try.
15061506
***
15071507

15081508
## Deploying Applications
1509-
This section discussed the steps to deploy a Spark Streaming applications.
1509+
This section discusses the steps to deploy a Spark Streaming application.
15101510

15111511
### Requirements
15121512
{:.no_toc}
@@ -1559,7 +1559,7 @@ To run a Spark Streaming applications, you need to have the following.
15591559
feature of write ahead logs. If enabled, all the data received from a receiver gets written into
15601560
a write ahead log in the configuration checkpoint directory. This prevents data loss on driver
15611561
recovery, thus allowing zero data loss guarantees which is discussed in detail in the
1562-
[Fault-tolerant Semantics](#fault-tolerant-semantics) section. Enable this by setting the
1562+
[Fault-tolerance Semantics](#fault-tolerance-semantics) section. Enable this by setting the
15631563
[configuration parameter](configuration.html#spark-streaming)
15641564
`spark.streaming.receiver.writeAheadLogs.enable` to `true`.
15651565

@@ -1605,7 +1605,7 @@ receivers are active, number of records received, receiver error, etc.)
16051605
and completed batches (batch processing times, queueing delays, etc.). This can be used to
16061606
monitor the progress of the streaming application.
16071607

1608-
The following two metrics in web UI are particularly important -
1608+
The following two metrics in web UI are particularly important:
16091609

16101610
- *Processing Time* - The time to process each batch of data.
16111611
- *Scheduling Delay* - the time a batch waits in a queue for the processing of previous batches
@@ -1698,12 +1698,12 @@ before further processing.
16981698
{:.no_toc}
16991699
Cluster resources can be under-utilized if the number of parallel tasks used in any stage of the
17001700
computation is not high enough. For example, for distributed reduce operations like `reduceByKey`
1701-
and `reduceByKeyAndWindow`, the default number of parallel tasks is decided by the [config property]
1702-
(configuration.html#spark-properties) `spark.default.parallelism`. You can pass the level of
1703-
parallelism as an argument (see [`PairDStreamFunctions`]
1704-
(api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions)
1705-
documentation), or set the [config property](configuration.html#spark-properties)
1706-
`spark.default.parallelism` to change the default.
1701+
and `reduceByKeyAndWindow`, the default number of parallel tasks is controlled by
1702+
the`spark.default.parallelism` [configuration property](configuration.html#spark-properties). You
1703+
can pass the level of parallelism as an argument (see
1704+
[`PairDStreamFunctions`](api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions)
1705+
documentation), or set the `spark.default.parallelism`
1706+
[configuration property](configuration.html#spark-properties) to change the default.
17071707

17081708
### Data Serialization
17091709
{:.no_toc}
@@ -1799,72 +1799,73 @@ consistent batch processing times.
17991799
***************************************************************************************************
18001800

18011801
# Fault-tolerance Semantics
1802-
In this section, we will discuss the behavior of Spark Streaming application in the event
1803-
of a node failure. To understand this, let us remember the basic fault-tolerance semantics of
1802+
In this section, we will discuss the behavior of Spark Streaming applications in the event
1803+
of node failures. To understand this, let us remember the basic fault-tolerance semantics of
18041804
Spark's RDDs.
18051805

18061806
1. An RDD is an immutable, deterministically re-computable, distributed dataset. Each RDD
18071807
remembers the lineage of deterministic operations that were used on a fault-tolerant input
18081808
dataset to create it.
18091809
1. If any partition of an RDD is lost due to a worker node failure, then that partition can be
18101810
re-computed from the original fault-tolerant dataset using the lineage of operations.
1811-
1. Assuming all the RDD transformations are deterministic, the data in the final transformed RDD
1812-
will always be the same irrespective of failures in Spark cluster.
1811+
1. Assuming that all of the RDD transformations are deterministic, the data in the final transformed
1812+
RDD will always be the same irrespective of failures in the Spark cluster.
18131813

18141814
Spark operates on data on fault-tolerant file systems like HDFS or S3. Hence,
1815-
all the RDDs generated from the fault-tolerant data are also fault-tolerant. However, this is not
1815+
all of the RDDs generated from the fault-tolerant data are also fault-tolerant. However, this is not
18161816
the case for Spark Streaming as the data in most cases is received over the network (except when
1817-
`fileStream` is used). To achieve the same fault-tolerance properties for all the generated RDDs,
1817+
`fileStream` is used). To achieve the same fault-tolerance properties for all of the generated RDDs,
18181818
the received data is replicated among multiple Spark executors in worker nodes in the cluster
18191819
(default replication factor is 2). This leads to two kinds of data in the
1820-
system that needs to recovered in the event of a failure.
1820+
system that needs to recovered in the event of failures:
18211821

18221822
1. *Data received and replicated* - This data survives failure of a single worker node as a copy
18231823
of it exists on one of the nodes.
18241824
1. *Data received but buffered for replication* - Since this is not replicated,
18251825
the only way to recover that data is to get it again from the source.
18261826

1827-
Furthermore, there are two kinds of failures that we should be concerned about.
1827+
Furthermore, there are two kinds of failures that we should be concerned about:
18281828

1829-
1. *Failure of a Worker Node* - Any of the workers in the cluster can fail,
1830-
and all in-memory data on that node will be lost. If there are any receiver running on that
1831-
node, all buffered data will be lost.
1829+
1. *Failure of a Worker Node* - Any of the worker nodes running executors can fail,
1830+
and all in-memory data on those nodes will be lost. If any receivers were running on failed
1831+
nodes, then their buffered data will be lost.
18321832
1. *Failure of the Driver Node* - If the driver node running the Spark Streaming application
1833-
fails, then obviously the SparkContext is lost, as well as all executors with their in-memory
1833+
fails, then obviously the SparkContext is lost, and all executors with their in-memory
18341834
data are lost.
18351835

18361836
With this basic knowledge, let us understand the fault-tolerance semantics of Spark Streaming.
18371837

18381838
## Semantics with files as input source
18391839
{:.no_toc}
1840-
In this case, since all the input data is already present in a fault-tolerant files system like
1840+
If all of the input data is already present in a fault-tolerant files system like
18411841
HDFS, Spark Streaming can always recover from any failure and process all the data. This gives
18421842
*exactly-once* semantics, that all the data will be processed exactly once no matter what fails.
18431843

18441844
## Semantics with input sources based on receivers
18451845
{:.no_toc}
1846-
Here we will first discuss the semantics in the context of different types of failures. As we
1847-
discussed [earlier](#receiver-reliability), there are two kinds of receivers.
1846+
For input sources based on receivers, the fault-tolerance semantics depend on both the failure
1847+
scenario and the type of receiver.
1848+
As we discussed [earlier](#receiver-reliability), there are two types of receivers:
18481849

18491850
1. *Reliable Receiver* - These receivers acknowledge reliable sources only after ensuring that
18501851
the received data has been replicated. If such a receiver fails,
18511852
the buffered (unreplicated) data does not get acknowledged to the source. If the receiver is
1852-
restarted, the source would resend the data, and so no data will be lost due to the failure.
1853+
restarted, the source will resend the data, and therefore no data will be lost due to the failure.
18531854
1. *Unreliable Receiver* - Such receivers can lose data when they fail due to worker
18541855
or driver failures.
18551856

18561857
Depending on what type of receivers are used we achieve the following semantics.
18571858
If a worker node fails, then there is no data loss with reliable receivers. With unreliable
18581859
receivers, data received but not replicated can get lost. If the driver node fails,
1859-
then besides these losses, all the past data that were received and replicated in memory will be
1860+
then besides these losses, all the past data that was received and replicated in memory will be
18601861
lost. This will affect the results of the stateful transformations.
18611862

1862-
To avoid this loss of past received data, Spark 1.2 introduces an experimental feature of write
1863-
ahead logs, that saves the received data to a fault-tolerant storage. With the [write ahead logs
1863+
To avoid this loss of past received data, Spark 1.2 introduces an experimental feature of _write
1864+
ahead logs_ which saves the received data to fault-tolerant storage. With the [write ahead logs
18641865
enabled](#deploying-applications) and reliable receivers, there is zero data loss and
18651866
exactly-once semantics.
18661867

1867-
The following table summarizes the semantics under failures.
1868+
The following table summarizes the semantics under failures:
18681869

18691870
<table class="table">
18701871
<tr>
@@ -1994,5 +1995,5 @@ package and renamed for better clarity.
19941995

19951996
* More examples in [Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples/streaming)
19961997
and [Java]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples/streaming)
1997-
and [Python] ({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python/streaming)
1998+
and [Python]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python/streaming)
19981999
* [Paper](http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf) and [video](http://youtu.be/g171ndOHgJ0) describing Spark Streaming.

0 commit comments

Comments
 (0)