@@ -51,8 +51,8 @@ different languages.
51
51
** Note:** * Python API has been introduced in Spark 1.2. It has all the DStream transformations
52
52
and almost all the output operations available in Scala and Java interfaces.
53
53
However, it has only support for basic sources like text files and text data over sockets.
54
- API for creating more sources like Kafka, and Flume will be available in future.
55
- Further information about available features in Python API are mentioned throughout this
54
+ APIs for additional sources, like Kafka and Flume, will be available in the future.
55
+ Further information about available features in the Python API is mentioned throughout this
56
56
document; look out for the tag* "** Note on Python API** ".
57
57
58
58
***************************************************************************************************
@@ -622,7 +622,7 @@ as well as, to run the receiver(s).
622
622
a input DStream based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single thread will
623
623
be used to run the receiver, leaving no thread for processing the received data. Hence, when
624
624
running locally, always use "local[ * n* ] " as the master URL where * n* > number of receivers to run
625
- (see [ Spark Properties] (configuration.html#spark-properties.html for information on how to set
625
+ (see [ Spark Properties] ( configuration.html#spark-properties.html ) for information on how to set
626
626
the master).
627
627
628
628
- Extending the logic to running on a cluster, the number of cores allocated to the Spark Streaming
@@ -667,7 +667,7 @@ methods for creating DStreams from files and Akka actors as input sources.
667
667
Guide] ( streaming-custom-receivers.html#implementing-and-using-a-custom-actor-based-receiver ) for
668
668
more details.
669
669
670
- * Note on Python API:** Since actors are available only in the Java and Scala
670
+ ** Note on Python API:** Since actors are available only in the Java and Scala
671
671
libraries, ` actorStream ` is not available in the Python API.
672
672
673
673
- ** Queue of RDDs as a Stream:** For testing a Spark Streaming application with test data, one can also create a DStream based on a queue of RDDs, using ` streamingContext.queueStream(queueOfRDDs) ` . Each RDD pushed into the queue will be treated as a batch of data in the DStream, and processed like a stream.
@@ -676,7 +676,7 @@ For more details on streams from sockets, files, and actors,
676
676
see the API documentations of the relevant functions in
677
677
[ StreamingContext] ( api/scala/index.html#org.apache.spark.streaming.StreamingContext ) for
678
678
Scala, [ JavaStreamingContext] ( api/java/index.html?org/apache/spark/streaming/api/java/JavaStreamingContext.html )
679
- for Java, and [ StreamingContext] .
679
+ for Java, and [ StreamingContext] ( api/python/pyspark.streaming.html#pyspark.streaming.StreamingContext ) for Python .
680
680
681
681
### Advanced Sources
682
682
{:.no_toc}
@@ -1506,7 +1506,7 @@ sliding interval of a DStream is good setting to try.
1506
1506
***
1507
1507
1508
1508
## Deploying Applications
1509
- This section discussed the steps to deploy a Spark Streaming applications .
1509
+ This section discusses the steps to deploy a Spark Streaming application .
1510
1510
1511
1511
### Requirements
1512
1512
{:.no_toc}
@@ -1559,7 +1559,7 @@ To run a Spark Streaming applications, you need to have the following.
1559
1559
feature of write ahead logs. If enabled, all the data received from a receiver gets written into
1560
1560
a write ahead log in the configuration checkpoint directory. This prevents data loss on driver
1561
1561
recovery, thus allowing zero data loss guarantees which is discussed in detail in the
1562
- [ Fault-tolerant Semantics] ( #fault-tolerant -semantics ) section. Enable this by setting the
1562
+ [ Fault-tolerance Semantics] ( #fault-tolerance -semantics ) section. Enable this by setting the
1563
1563
[ configuration parameter] ( configuration.html#spark-streaming )
1564
1564
` spark.streaming.receiver.writeAheadLogs.enable ` to ` true ` .
1565
1565
@@ -1605,7 +1605,7 @@ receivers are active, number of records received, receiver error, etc.)
1605
1605
and completed batches (batch processing times, queueing delays, etc.). This can be used to
1606
1606
monitor the progress of the streaming application.
1607
1607
1608
- The following two metrics in web UI are particularly important -
1608
+ The following two metrics in web UI are particularly important:
1609
1609
1610
1610
- * Processing Time* - The time to process each batch of data.
1611
1611
- * Scheduling Delay* - the time a batch waits in a queue for the processing of previous batches
@@ -1698,12 +1698,12 @@ before further processing.
1698
1698
{:.no_toc}
1699
1699
Cluster resources can be under-utilized if the number of parallel tasks used in any stage of the
1700
1700
computation is not high enough. For example, for distributed reduce operations like ` reduceByKey `
1701
- and ` reduceByKeyAndWindow ` , the default number of parallel tasks is decided by the [ config property ]
1702
- (configuration.html#spark-properties) ` spark.default.parallelism ` . You can pass the level of
1703
- parallelism as an argument (see [ ` PairDStreamFunctions ` ]
1704
- (api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions)
1705
- documentation), or set the [ config property ] ( configuration.html# spark-properties )
1706
- ` spark.default.parallelism ` to change the default.
1701
+ and ` reduceByKeyAndWindow ` , the default number of parallel tasks is controlled by
1702
+ the ` spark.default.parallelism ` [ configuration property ] ( configuration.html#spark-properties ) . You
1703
+ can pass the level of parallelism as an argument (see
1704
+ [ ` PairDStreamFunctions ` ] ( api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions )
1705
+ documentation), or set the ` spark.default.parallelism `
1706
+ [ configuration property ] ( configuration.html#spark-properties ) to change the default.
1707
1707
1708
1708
### Data Serialization
1709
1709
{:.no_toc}
@@ -1799,72 +1799,73 @@ consistent batch processing times.
1799
1799
***************************************************************************************************
1800
1800
1801
1801
# Fault-tolerance Semantics
1802
- In this section, we will discuss the behavior of Spark Streaming application in the event
1803
- of a node failure . To understand this, let us remember the basic fault-tolerance semantics of
1802
+ In this section, we will discuss the behavior of Spark Streaming applications in the event
1803
+ of node failures . To understand this, let us remember the basic fault-tolerance semantics of
1804
1804
Spark's RDDs.
1805
1805
1806
1806
1 . An RDD is an immutable, deterministically re-computable, distributed dataset. Each RDD
1807
1807
remembers the lineage of deterministic operations that were used on a fault-tolerant input
1808
1808
dataset to create it.
1809
1809
1 . If any partition of an RDD is lost due to a worker node failure, then that partition can be
1810
1810
re-computed from the original fault-tolerant dataset using the lineage of operations.
1811
- 1 . Assuming all the RDD transformations are deterministic, the data in the final transformed RDD
1812
- will always be the same irrespective of failures in Spark cluster.
1811
+ 1 . Assuming that all of the RDD transformations are deterministic, the data in the final transformed
1812
+ RDD will always be the same irrespective of failures in the Spark cluster.
1813
1813
1814
1814
Spark operates on data on fault-tolerant file systems like HDFS or S3. Hence,
1815
- all the RDDs generated from the fault-tolerant data are also fault-tolerant. However, this is not
1815
+ all of the RDDs generated from the fault-tolerant data are also fault-tolerant. However, this is not
1816
1816
the case for Spark Streaming as the data in most cases is received over the network (except when
1817
- ` fileStream ` is used). To achieve the same fault-tolerance properties for all the generated RDDs,
1817
+ ` fileStream ` is used). To achieve the same fault-tolerance properties for all of the generated RDDs,
1818
1818
the received data is replicated among multiple Spark executors in worker nodes in the cluster
1819
1819
(default replication factor is 2). This leads to two kinds of data in the
1820
- system that needs to recovered in the event of a failure.
1820
+ system that needs to recovered in the event of failures:
1821
1821
1822
1822
1 . * Data received and replicated* - This data survives failure of a single worker node as a copy
1823
1823
of it exists on one of the nodes.
1824
1824
1 . * Data received but buffered for replication* - Since this is not replicated,
1825
1825
the only way to recover that data is to get it again from the source.
1826
1826
1827
- Furthermore, there are two kinds of failures that we should be concerned about.
1827
+ Furthermore, there are two kinds of failures that we should be concerned about:
1828
1828
1829
- 1 . * Failure of a Worker Node* - Any of the workers in the cluster can fail,
1830
- and all in-memory data on that node will be lost. If there are any receiver running on that
1831
- node, all buffered data will be lost.
1829
+ 1 . * Failure of a Worker Node* - Any of the worker nodes running executors can fail,
1830
+ and all in-memory data on those nodes will be lost. If any receivers were running on failed
1831
+ nodes, then their buffered data will be lost.
1832
1832
1 . * Failure of the Driver Node* - If the driver node running the Spark Streaming application
1833
- fails, then obviously the SparkContext is lost, as well as all executors with their in-memory
1833
+ fails, then obviously the SparkContext is lost, and all executors with their in-memory
1834
1834
data are lost.
1835
1835
1836
1836
With this basic knowledge, let us understand the fault-tolerance semantics of Spark Streaming.
1837
1837
1838
1838
## Semantics with files as input source
1839
1839
{:.no_toc}
1840
- In this case, since all the input data is already present in a fault-tolerant files system like
1840
+ If all of the input data is already present in a fault-tolerant files system like
1841
1841
HDFS, Spark Streaming can always recover from any failure and process all the data. This gives
1842
1842
* exactly-once* semantics, that all the data will be processed exactly once no matter what fails.
1843
1843
1844
1844
## Semantics with input sources based on receivers
1845
1845
{:.no_toc}
1846
- Here we will first discuss the semantics in the context of different types of failures. As we
1847
- discussed [ earlier] ( #receiver-reliability ) , there are two kinds of receivers.
1846
+ For input sources based on receivers, the fault-tolerance semantics depend on both the failure
1847
+ scenario and the type of receiver.
1848
+ As we discussed [ earlier] ( #receiver-reliability ) , there are two types of receivers:
1848
1849
1849
1850
1 . * Reliable Receiver* - These receivers acknowledge reliable sources only after ensuring that
1850
1851
the received data has been replicated. If such a receiver fails,
1851
1852
the buffered (unreplicated) data does not get acknowledged to the source. If the receiver is
1852
- restarted, the source would resend the data, and so no data will be lost due to the failure.
1853
+ restarted, the source will resend the data, and therefore no data will be lost due to the failure.
1853
1854
1 . * Unreliable Receiver* - Such receivers can lose data when they fail due to worker
1854
1855
or driver failures.
1855
1856
1856
1857
Depending on what type of receivers are used we achieve the following semantics.
1857
1858
If a worker node fails, then there is no data loss with reliable receivers. With unreliable
1858
1859
receivers, data received but not replicated can get lost. If the driver node fails,
1859
- then besides these losses, all the past data that were received and replicated in memory will be
1860
+ then besides these losses, all the past data that was received and replicated in memory will be
1860
1861
lost. This will affect the results of the stateful transformations.
1861
1862
1862
- To avoid this loss of past received data, Spark 1.2 introduces an experimental feature of write
1863
- ahead logs, that saves the received data to a fault-tolerant storage. With the [ write ahead logs
1863
+ To avoid this loss of past received data, Spark 1.2 introduces an experimental feature of _ write
1864
+ ahead logs _ which saves the received data to fault-tolerant storage. With the [ write ahead logs
1864
1865
enabled] ( #deploying-applications ) and reliable receivers, there is zero data loss and
1865
1866
exactly-once semantics.
1866
1867
1867
- The following table summarizes the semantics under failures.
1868
+ The following table summarizes the semantics under failures:
1868
1869
1869
1870
<table class =" table " >
1870
1871
<tr >
@@ -1994,5 +1995,5 @@ package and renamed for better clarity.
1994
1995
1995
1996
* More examples in [ Scala] ( {{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples/streaming )
1996
1997
and [ Java] ( {{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples/streaming )
1997
- and [ Python] ({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python/streaming)
1998
+ and [ Python] ( {{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python/streaming )
1998
1999
* [ Paper] ( http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf ) and [ video] ( http://youtu.be/g171ndOHgJ0 ) describing Spark Streaming.
0 commit comments