Skip to content

Commit f015397

Browse files
committed
Minor grammar / pluralization fixes.
1 parent 3019f3a commit f015397

File tree

1 file changed

+16
-16
lines changed

1 file changed

+16
-16
lines changed

docs/streaming-programming-guide.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -51,8 +51,8 @@ different languages.
5151
**Note:** *Python API has been introduced in Spark 1.2. It has all the DStream transformations
5252
and almost all the output operations available in Scala and Java interfaces.
5353
However, it has only support for basic sources like text files and text data over sockets.
54-
API for creating more sources like Kafka, and Flume will be available in future.
55-
Further information about available features in Python API are mentioned throughout this
54+
APIs for additional sources, like Kafka and Flume, will be available in the future.
55+
Further information about available features in the Python API is mentioned throughout this
5656
document; look out for the tag* "**Note on Python API**".
5757

5858
***************************************************************************************************
@@ -1506,7 +1506,7 @@ sliding interval of a DStream is good setting to try.
15061506
***
15071507

15081508
## Deploying Applications
1509-
This section discussed the steps to deploy a Spark Streaming applications.
1509+
This section discusses the steps to deploy a Spark Streaming application.
15101510

15111511
### Requirements
15121512
{:.no_toc}
@@ -1605,7 +1605,7 @@ receivers are active, number of records received, receiver error, etc.)
16051605
and completed batches (batch processing times, queueing delays, etc.). This can be used to
16061606
monitor the progress of the streaming application.
16071607

1608-
The following two metrics in web UI are particularly important -
1608+
The following two metrics in web UI are particularly important:
16091609

16101610
- *Processing Time* - The time to process each batch of data.
16111611
- *Scheduling Delay* - the time a batch waits in a queue for the processing of previous batches
@@ -1799,38 +1799,38 @@ consistent batch processing times.
17991799
***************************************************************************************************
18001800

18011801
# Fault-tolerance Semantics
1802-
In this section, we will discuss the behavior of Spark Streaming application in the event
1803-
of a node failure. To understand this, let us remember the basic fault-tolerance semantics of
1802+
In this section, we will discuss the behavior of Spark Streaming applications in the event
1803+
of node failures. To understand this, let us remember the basic fault-tolerance semantics of
18041804
Spark's RDDs.
18051805

18061806
1. An RDD is an immutable, deterministically re-computable, distributed dataset. Each RDD
18071807
remembers the lineage of deterministic operations that were used on a fault-tolerant input
18081808
dataset to create it.
18091809
1. If any partition of an RDD is lost due to a worker node failure, then that partition can be
18101810
re-computed from the original fault-tolerant dataset using the lineage of operations.
1811-
1. Assuming all the RDD transformations are deterministic, the data in the final transformed RDD
1812-
will always be the same irrespective of failures in Spark cluster.
1811+
1. Assuming that all of the RDD transformations are deterministic, the data in the final transformed
1812+
RDD will always be the same irrespective of failures in the Spark cluster.
18131813

18141814
Spark operates on data on fault-tolerant file systems like HDFS or S3. Hence,
1815-
all the RDDs generated from the fault-tolerant data are also fault-tolerant. However, this is not
1815+
all of the RDDs generated from the fault-tolerant data are also fault-tolerant. However, this is not
18161816
the case for Spark Streaming as the data in most cases is received over the network (except when
1817-
`fileStream` is used). To achieve the same fault-tolerance properties for all the generated RDDs,
1817+
`fileStream` is used). To achieve the same fault-tolerance properties for all of the generated RDDs,
18181818
the received data is replicated among multiple Spark executors in worker nodes in the cluster
18191819
(default replication factor is 2). This leads to two kinds of data in the
1820-
system that needs to recovered in the event of a failure.
1820+
system that needs to recovered in the event of failures:
18211821

18221822
1. *Data received and replicated* - This data survives failure of a single worker node as a copy
18231823
of it exists on one of the nodes.
18241824
1. *Data received but buffered for replication* - Since this is not replicated,
18251825
the only way to recover that data is to get it again from the source.
18261826

1827-
Furthermore, there are two kinds of failures that we should be concerned about.
1827+
Furthermore, there are two kinds of failures that we should be concerned about:
18281828

1829-
1. *Failure of a Worker Node* - Any of the workers in the cluster can fail,
1830-
and all in-memory data on that node will be lost. If there are any receiver running on that
1831-
node, all buffered data will be lost.
1829+
1. *Failure of a Worker Node* - Any of the nodes in the cluster can fail,
1830+
and all in-memory data on those nodes will be lost. If any receivers were running on failed
1831+
nodes, then their buffered data will be lost.
18321832
1. *Failure of the Driver Node* - If the driver node running the Spark Streaming application
1833-
fails, then obviously the SparkContext is lost, as well as all executors with their in-memory
1833+
fails, then obviously the SparkContext is lost, and all executors with their in-memory
18341834
data are lost.
18351835

18361836
With this basic knowledge, let us understand the fault-tolerance semantics of Spark Streaming.

0 commit comments

Comments
 (0)