@@ -51,8 +51,8 @@ different languages.
51
51
** Note:** * Python API has been introduced in Spark 1.2. It has all the DStream transformations
52
52
and almost all the output operations available in Scala and Java interfaces.
53
53
However, it has only support for basic sources like text files and text data over sockets.
54
- API for creating more sources like Kafka, and Flume will be available in future.
55
- Further information about available features in Python API are mentioned throughout this
54
+ APIs for additional sources, like Kafka and Flume, will be available in the future.
55
+ Further information about available features in the Python API is mentioned throughout this
56
56
document; look out for the tag* "** Note on Python API** ".
57
57
58
58
***************************************************************************************************
@@ -1506,7 +1506,7 @@ sliding interval of a DStream is good setting to try.
1506
1506
***
1507
1507
1508
1508
## Deploying Applications
1509
- This section discussed the steps to deploy a Spark Streaming applications .
1509
+ This section discusses the steps to deploy a Spark Streaming application .
1510
1510
1511
1511
### Requirements
1512
1512
{:.no_toc}
@@ -1605,7 +1605,7 @@ receivers are active, number of records received, receiver error, etc.)
1605
1605
and completed batches (batch processing times, queueing delays, etc.). This can be used to
1606
1606
monitor the progress of the streaming application.
1607
1607
1608
- The following two metrics in web UI are particularly important -
1608
+ The following two metrics in web UI are particularly important:
1609
1609
1610
1610
- * Processing Time* - The time to process each batch of data.
1611
1611
- * Scheduling Delay* - the time a batch waits in a queue for the processing of previous batches
@@ -1799,38 +1799,38 @@ consistent batch processing times.
1799
1799
***************************************************************************************************
1800
1800
1801
1801
# Fault-tolerance Semantics
1802
- In this section, we will discuss the behavior of Spark Streaming application in the event
1803
- of a node failure . To understand this, let us remember the basic fault-tolerance semantics of
1802
+ In this section, we will discuss the behavior of Spark Streaming applications in the event
1803
+ of node failures . To understand this, let us remember the basic fault-tolerance semantics of
1804
1804
Spark's RDDs.
1805
1805
1806
1806
1 . An RDD is an immutable, deterministically re-computable, distributed dataset. Each RDD
1807
1807
remembers the lineage of deterministic operations that were used on a fault-tolerant input
1808
1808
dataset to create it.
1809
1809
1 . If any partition of an RDD is lost due to a worker node failure, then that partition can be
1810
1810
re-computed from the original fault-tolerant dataset using the lineage of operations.
1811
- 1 . Assuming all the RDD transformations are deterministic, the data in the final transformed RDD
1812
- will always be the same irrespective of failures in Spark cluster.
1811
+ 1 . Assuming that all of the RDD transformations are deterministic, the data in the final transformed
1812
+ RDD will always be the same irrespective of failures in the Spark cluster.
1813
1813
1814
1814
Spark operates on data on fault-tolerant file systems like HDFS or S3. Hence,
1815
- all the RDDs generated from the fault-tolerant data are also fault-tolerant. However, this is not
1815
+ all of the RDDs generated from the fault-tolerant data are also fault-tolerant. However, this is not
1816
1816
the case for Spark Streaming as the data in most cases is received over the network (except when
1817
- ` fileStream ` is used). To achieve the same fault-tolerance properties for all the generated RDDs,
1817
+ ` fileStream ` is used). To achieve the same fault-tolerance properties for all of the generated RDDs,
1818
1818
the received data is replicated among multiple Spark executors in worker nodes in the cluster
1819
1819
(default replication factor is 2). This leads to two kinds of data in the
1820
- system that needs to recovered in the event of a failure.
1820
+ system that needs to recovered in the event of failures:
1821
1821
1822
1822
1 . * Data received and replicated* - This data survives failure of a single worker node as a copy
1823
1823
of it exists on one of the nodes.
1824
1824
1 . * Data received but buffered for replication* - Since this is not replicated,
1825
1825
the only way to recover that data is to get it again from the source.
1826
1826
1827
- Furthermore, there are two kinds of failures that we should be concerned about.
1827
+ Furthermore, there are two kinds of failures that we should be concerned about:
1828
1828
1829
- 1 . * Failure of a Worker Node* - Any of the workers in the cluster can fail,
1830
- and all in-memory data on that node will be lost. If there are any receiver running on that
1831
- node, all buffered data will be lost.
1829
+ 1 . * Failure of a Worker Node* - Any of the nodes in the cluster can fail,
1830
+ and all in-memory data on those nodes will be lost. If any receivers were running on failed
1831
+ nodes, then their buffered data will be lost.
1832
1832
1 . * Failure of the Driver Node* - If the driver node running the Spark Streaming application
1833
- fails, then obviously the SparkContext is lost, as well as all executors with their in-memory
1833
+ fails, then obviously the SparkContext is lost, and all executors with their in-memory
1834
1834
data are lost.
1835
1835
1836
1836
With this basic knowledge, let us understand the fault-tolerance semantics of Spark Streaming.
0 commit comments