Skip to content

Commit 8849a96

Browse files
andrewor14tdas
authored andcommitted
Fix typos in Spark Streaming programming guide
Author: Andrew Or <andrewor14@gmail.com> Closes #536 from andrewor14/streaming-typos and squashes the following commits: a05faa6 [Andrew Or] Fix broken link and wording bc2e4bc [Andrew Or] Merge github.com:apache/incubator-spark into streaming-typos d5515b4 [Andrew Or] TD's comments 767ef12 [Andrew Or] Fix broken links 8f4c731 [Andrew Or] Fix typos in programming guide
1 parent fbd66a5 commit 8849a96

File tree

1 file changed

+13
-14
lines changed

1 file changed

+13
-14
lines changed

docs/streaming-programming-guide.md

Lines changed: 13 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -428,9 +428,9 @@ KafkaUtils.createStream(javaStreamingContext, kafkaParams, ...);
428428
</div>
429429
</div>
430430

431-
For more details on these additional sources, see the corresponding [API documentation]
432-
(#where-to-go-from-here). Furthermore, you can also implement your own custom receiver
433-
for your sources. See the [Custom Receiver Guide](streaming-custom-receivers.html).
431+
For more details on these additional sources, see the corresponding [API documentation](#where-to-go-from-here).
432+
Furthermore, you can also implement your own custom receiver for your sources. See the
433+
[Custom Receiver Guide](streaming-custom-receivers.html).
434434

435435
## Operations
436436
There are two kinds of DStream operations - _transformations_ and _output operations_. Similar to
@@ -520,9 +520,8 @@ The last two transformations are worth highlighting again.
520520

521521
<h4>UpdateStateByKey Operation</h4>
522522

523-
The `updateStateByKey` operation allows
524-
you to main arbitrary stateful computation, where you want to maintain some state data and
525-
continuously update it with new information. To use this, you will have to do two steps.
523+
The `updateStateByKey` operation allows you to maintain arbitrary state while continuously updating
524+
it with new information. To use this, you will have to do two steps.
526525

527526
1. Define the state - The state can be of arbitrary data type.
528527
1. Define the state update function - Specify with a function how to update the state using the
@@ -925,7 +924,7 @@ exception saying so.
925924
## Monitoring
926925
Besides Spark's in-built [monitoring capabilities](monitoring.html),
927926
the progress of a Spark Streaming program can also be monitored using the [StreamingListener]
928-
(streaming/index.html#org.apache.spark.scheduler.StreamingListener) interface,
927+
(api/streaming/index.html#org.apache.spark.scheduler.StreamingListener) interface,
929928
which allows you to get statistics of batch processing times, queueing delays,
930929
and total end-to-end delays. Note that this is still an experimental API and it is likely to be
931930
improved upon (i.e., more information reported) in the future.
@@ -1000,11 +999,11 @@ Since all data is modeled as RDDs with their lineage of deterministic operations
1000999
for output operations.
10011000

10021001
## Failure of the Driver Node
1003-
To allows a streaming application to operate 24/7, Spark Streaming allows a streaming computation
1002+
For a streaming application to operate 24/7, Spark Streaming allows a streaming computation
10041003
to be resumed even after the failure of the driver node. Spark Streaming periodically writes the
10051004
metadata information of the DStreams setup through the `StreamingContext` to a
10061005
HDFS directory (can be any Hadoop-compatible filesystem). This periodic
1007-
*checkpointing* can be enabled by setting a the checkpoint
1006+
*checkpointing* can be enabled by setting the checkpoint
10081007
directory using `ssc.checkpoint(<checkpoint directory>)` as described
10091008
[earlier](#rdd-checkpointing). On failure of the driver node,
10101009
the lost `StreamingContext` can be recovered from this information, and restarted.
@@ -1105,8 +1104,8 @@ classes. So, if you are using `getOrCreate`, then make sure that the checkpoint
11051104
explicitly deleted every time recompiled code needs to be launched.
11061105

11071106
This failure recovery can be done automatically using Spark's
1108-
[standalone cluster mode](spark-standalone.html), which allows any Spark
1109-
application's driver to be as well as, ensures automatic restart of the driver on failure (see
1107+
[standalone cluster mode](spark-standalone.html), which allows the driver of any Spark application
1108+
to be launched within the cluster and be restarted on failure (see
11101109
[supervise mode](spark-standalone.html#launching-applications-inside-the-cluster)). This can be
11111110
tested locally by launching the above example using the supervise mode in a
11121111
local standalone cluster and killing the java process running the driver (will be shown as
@@ -1123,7 +1122,7 @@ There are two different failure behaviors based on which input sources are used.
11231122
1. _Using HDFS files as input source_ - Since the data is reliably stored on HDFS, all data can
11241123
re-computed and therefore no data will be lost due to any failure.
11251124
1. _Using any input source that receives data through a network_ - The received input data is
1126-
replicated in memory to multiple nodes. Since, all the data in the Spark worker's memory is lost
1125+
replicated in memory to multiple nodes. Since all the data in the Spark worker's memory is lost
11271126
when the Spark driver fails, the past input data will not be accessible and driver recovers.
11281127
Hence, if stateful and window-based operations are used
11291128
(like `updateStateByKey`, `window`, `countByValueAndWindow`, etc.), then the intermediate state
@@ -1133,11 +1132,11 @@ In future releases, we will support full recoverability for all input sources. N
11331132
non-stateful transformations like `map`, `count`, and `reduceByKey`, with _all_ input streams,
11341133
the system, upon restarting, will continue to receive and process new data.
11351134

1136-
To better understand the behavior of the system under driver failure with a HDFS source, lets
1135+
To better understand the behavior of the system under driver failure with a HDFS source, let's
11371136
consider what will happen with a file input stream. Specifically, in the case of the file input
11381137
stream, it will correctly identify new files that were created while the driver was down and
11391138
process them in the same way as it would have if the driver had not failed. To explain further
1140-
in the case of file input stream, we shall use an example. Lets say, files are being generated
1139+
in the case of file input stream, we shall use an example. Let's say, files are being generated
11411140
every second, and a Spark Streaming program reads every new file and output the number of lines
11421141
in the file. This is what the sequence of outputs would be with and without a driver failure.
11431142

0 commit comments

Comments
 (0)