@@ -51,8 +51,9 @@ different languages.
51
51
** Note:** Python API for Spark Streaming has been introduced in Spark 1.2. It has all the DStream
52
52
transformations and almost all the output operations available in Scala and Java interfaces.
53
53
However, it has only support for basic sources like text files and text data over sockets.
54
- API for creating more sources like Kafka, and Flume will be available in future.
55
- Further information about available features in Python API are mentioned throughout this
54
+ <<<<<<< HEAD
55
+ APIs for additional sources, like Kafka and Flume, will be available in the future.
56
+ Further information about available features in the Python API are mentioned throughout this
56
57
document; look out for the tag
57
58
<span class =" badge " style =" background-color : grey " >Python API</span >.
58
59
@@ -622,7 +623,7 @@ as well as, to run the receiver(s).
622
623
a input DStream based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single thread will
623
624
be used to run the receiver, leaving no thread for processing the received data. Hence, when
624
625
running locally, always use "local[ * n* ] " as the master URL where * n* > number of receivers to run
625
- (see [ Spark Properties] (configuration.html#spark-properties.html for information on how to set
626
+ (see [ Spark Properties] ( configuration.html#spark-properties.html ) for information on how to set
626
627
the master).
627
628
628
629
- Extending the logic to running on a cluster, the number of cores allocated to the Spark Streaming
@@ -676,7 +677,7 @@ For more details on streams from sockets, files, and actors,
676
677
see the API documentations of the relevant functions in
677
678
[ StreamingContext] ( api/scala/index.html#org.apache.spark.streaming.StreamingContext ) for
678
679
Scala, [ JavaStreamingContext] ( api/java/index.html?org/apache/spark/streaming/api/java/JavaStreamingContext.html )
679
- for Java, and [ StreamingContext] .
680
+ for Java, and [ StreamingContext] ( api/python/pyspark.streaming.html#pyspark.streaming.StreamingContext ) for Python .
680
681
681
682
### Advanced Sources
682
683
{:.no_toc}
@@ -1517,7 +1518,7 @@ sliding interval of a DStream is good setting to try.
1517
1518
***
1518
1519
1519
1520
## Deploying Applications
1520
- This section discussed the steps to deploy a Spark Streaming applications .
1521
+ This section discusses the steps to deploy a Spark Streaming application .
1521
1522
1522
1523
### Requirements
1523
1524
{:.no_toc}
@@ -1571,7 +1572,7 @@ To run a Spark Streaming applications, you need to have the following.
1571
1572
feature of write ahead logs. If enabled, all the data received from a receiver gets written into
1572
1573
a write ahead log in the configuration checkpoint directory. This prevents data loss on driver
1573
1574
recovery, thus allowing zero data loss guarantees which is discussed in detail in the
1574
- [ Fault-tolerant Semantics] ( #fault-tolerant -semantics ) section. Enable this by setting the
1575
+ [ Fault-tolerance Semantics] ( #fault-tolerance -semantics ) section. Enable this by setting the
1575
1576
[ configuration parameter] ( configuration.html#spark-streaming )
1576
1577
` spark.streaming.receiver.writeAheadLogs.enable ` to ` true ` .
1577
1578
@@ -1617,7 +1618,7 @@ receivers are active, number of records received, receiver error, etc.)
1617
1618
and completed batches (batch processing times, queueing delays, etc.). This can be used to
1618
1619
monitor the progress of the streaming application.
1619
1620
1620
- The following two metrics in web UI are particularly important -
1621
+ The following two metrics in web UI are particularly important:
1621
1622
1622
1623
- * Processing Time* - The time to process each batch of data.
1623
1624
- * Scheduling Delay* - the time a batch waits in a queue for the processing of previous batches
@@ -1710,12 +1711,12 @@ before further processing.
1710
1711
{:.no_toc}
1711
1712
Cluster resources can be under-utilized if the number of parallel tasks used in any stage of the
1712
1713
computation is not high enough. For example, for distributed reduce operations like ` reduceByKey `
1713
- and ` reduceByKeyAndWindow ` , the default number of parallel tasks is decided by the
1714
- [ config property] ( configuration.html#spark-properties ) ` spark.default.parallelism ` .
1715
- You can pass the level of parallelism as an argument (see
1714
+ and ` reduceByKeyAndWindow ` , the default number of parallel tasks is controlled by
1715
+ the ` spark.default.parallelism ` [ configuration property] ( configuration.html#spark-properties ) . You
1716
+ can pass the level of parallelism as an argument (see
1716
1717
[ ` PairDStreamFunctions ` ] ( api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions )
1717
- documentation), or set the [ config property ] ( configuration.html# spark-properties )
1718
- ` spark.default.parallelism ` to change the default.
1718
+ documentation), or set the ` spark.default.parallelism `
1719
+ [ configuration property ] ( configuration.html#spark-properties ) to change the default.
1719
1720
1720
1721
### Data Serialization
1721
1722
{:.no_toc}
@@ -1811,72 +1812,73 @@ consistent batch processing times.
1811
1812
***************************************************************************************************
1812
1813
1813
1814
# Fault-tolerance Semantics
1814
- In this section, we will discuss the behavior of Spark Streaming application in the event
1815
- of a node failure . To understand this, let us remember the basic fault-tolerance semantics of
1815
+ In this section, we will discuss the behavior of Spark Streaming applications in the event
1816
+ of node failures . To understand this, let us remember the basic fault-tolerance semantics of
1816
1817
Spark's RDDs.
1817
1818
1818
1819
1 . An RDD is an immutable, deterministically re-computable, distributed dataset. Each RDD
1819
1820
remembers the lineage of deterministic operations that were used on a fault-tolerant input
1820
1821
dataset to create it.
1821
1822
1 . If any partition of an RDD is lost due to a worker node failure, then that partition can be
1822
1823
re-computed from the original fault-tolerant dataset using the lineage of operations.
1823
- 1 . Assuming all the RDD transformations are deterministic, the data in the final transformed RDD
1824
- will always be the same irrespective of failures in Spark cluster.
1824
+ 1 . Assuming that all of the RDD transformations are deterministic, the data in the final transformed
1825
+ RDD will always be the same irrespective of failures in the Spark cluster.
1825
1826
1826
1827
Spark operates on data on fault-tolerant file systems like HDFS or S3. Hence,
1827
- all the RDDs generated from the fault-tolerant data are also fault-tolerant. However, this is not
1828
+ all of the RDDs generated from the fault-tolerant data are also fault-tolerant. However, this is not
1828
1829
the case for Spark Streaming as the data in most cases is received over the network (except when
1829
- ` fileStream ` is used). To achieve the same fault-tolerance properties for all the generated RDDs,
1830
+ ` fileStream ` is used). To achieve the same fault-tolerance properties for all of the generated RDDs,
1830
1831
the received data is replicated among multiple Spark executors in worker nodes in the cluster
1831
1832
(default replication factor is 2). This leads to two kinds of data in the
1832
- system that needs to recovered in the event of a failure.
1833
+ system that needs to recovered in the event of failures:
1833
1834
1834
1835
1 . * Data received and replicated* - This data survives failure of a single worker node as a copy
1835
1836
of it exists on one of the nodes.
1836
1837
1 . * Data received but buffered for replication* - Since this is not replicated,
1837
1838
the only way to recover that data is to get it again from the source.
1838
1839
1839
- Furthermore, there are two kinds of failures that we should be concerned about.
1840
+ Furthermore, there are two kinds of failures that we should be concerned about:
1840
1841
1841
- 1 . * Failure of a Worker Node* - Any of the workers in the cluster can fail,
1842
- and all in-memory data on that node will be lost. If there are any receiver running on that
1843
- node, all buffered data will be lost.
1842
+ 1 . * Failure of a Worker Node* - Any of the worker nodes running executors can fail,
1843
+ and all in-memory data on those nodes will be lost. If any receivers were running on failed
1844
+ nodes, then their buffered data will be lost.
1844
1845
1 . * Failure of the Driver Node* - If the driver node running the Spark Streaming application
1845
- fails, then obviously the SparkContext is lost, as well as all executors with their in-memory
1846
+ fails, then obviously the SparkContext is lost, and all executors with their in-memory
1846
1847
data are lost.
1847
1848
1848
1849
With this basic knowledge, let us understand the fault-tolerance semantics of Spark Streaming.
1849
1850
1850
1851
## Semantics with files as input source
1851
1852
{:.no_toc}
1852
- In this case, since all the input data is already present in a fault-tolerant files system like
1853
+ If all of the input data is already present in a fault-tolerant files system like
1853
1854
HDFS, Spark Streaming can always recover from any failure and process all the data. This gives
1854
1855
* exactly-once* semantics, that all the data will be processed exactly once no matter what fails.
1855
1856
1856
1857
## Semantics with input sources based on receivers
1857
1858
{:.no_toc}
1858
- Here we will first discuss the semantics in the context of different types of failures. As we
1859
- discussed [ earlier] ( #receiver-reliability ) , there are two kinds of receivers.
1859
+ For input sources based on receivers, the fault-tolerance semantics depend on both the failure
1860
+ scenario and the type of receiver.
1861
+ As we discussed [ earlier] ( #receiver-reliability ) , there are two types of receivers:
1860
1862
1861
1863
1 . * Reliable Receiver* - These receivers acknowledge reliable sources only after ensuring that
1862
1864
the received data has been replicated. If such a receiver fails,
1863
1865
the buffered (unreplicated) data does not get acknowledged to the source. If the receiver is
1864
- restarted, the source would resend the data, and so no data will be lost due to the failure.
1866
+ restarted, the source will resend the data, and therefore no data will be lost due to the failure.
1865
1867
1 . * Unreliable Receiver* - Such receivers can lose data when they fail due to worker
1866
1868
or driver failures.
1867
1869
1868
1870
Depending on what type of receivers are used we achieve the following semantics.
1869
1871
If a worker node fails, then there is no data loss with reliable receivers. With unreliable
1870
1872
receivers, data received but not replicated can get lost. If the driver node fails,
1871
- then besides these losses, all the past data that were received and replicated in memory will be
1873
+ then besides these losses, all the past data that was received and replicated in memory will be
1872
1874
lost. This will affect the results of the stateful transformations.
1873
1875
1874
- To avoid this loss of past received data, Spark 1.2 introduces an experimental feature of write
1875
- ahead logs, that saves the received data to a fault-tolerant storage. With the [ write ahead logs
1876
+ To avoid this loss of past received data, Spark 1.2 introduces an experimental feature of _ write
1877
+ ahead logs _ which saves the received data to fault-tolerant storage. With the [ write ahead logs
1876
1878
enabled] ( #deploying-applications ) and reliable receivers, there is zero data loss and
1877
1879
exactly-once semantics.
1878
1880
1879
- The following table summarizes the semantics under failures.
1881
+ The following table summarizes the semantics under failures:
1880
1882
1881
1883
<table class =" table " >
1882
1884
<tr >
@@ -2006,5 +2008,5 @@ package and renamed for better clarity.
2006
2008
2007
2009
* More examples in [ Scala] ( {{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples/streaming )
2008
2010
and [ Java] ( {{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples/streaming )
2009
- and [ Python] ({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python/streaming)
2011
+ and [ Python] ( {{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python/streaming )
2010
2012
* [ Paper] ( http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf ) and [ video] ( http://youtu.be/g171ndOHgJ0 ) describing Spark Streaming.
0 commit comments